Regression and Linear Models studies how we use observed data to explain relationships, estimate effects, and make predictions. The course begins with regression as both a geometric projection problem and a statistical model for uncertainty, then develops ordinary least squares in matrix form as the central computational and conceptual tool. From there, it introduces the probabilistic assumptions behind linear regression and shows how those assumptions support estimation, inference, and interpretation.
A major theme is the connection between algebra, geometry, and probability. The course moves from the mechanics of fitting models to the theory that justifies them: the Gauss-Markov theorem, exact inference under normal errors, and large-sample methods with robust standard errors. It then broadens to practical model building, diagnostics, and influence analysis, so that fitted models can be checked, refined, and communicated responsibly.
Later chapters extend the classical linear model to settings where OLS is not enough. Regularization and high-dimensional regression address modern problems with many predictors, while generalized linear models and logistic regression expand the framework to non-Gaussian outcomes. The course closes with prediction, validation, and communication, tying together theory and practice so that regression becomes not just a fitting method, but a disciplined approach to scientific and data-driven reasoning.
# Introduction
This opening chapter sets the agenda for a course on regression and linear models. Regression is the study of how a response variable changes with covariates, but the course treats this question through probability, geometry, and inference rather than through curve fitting alone. The central object is a conditional mean, while the central computational method is least squares projection. Chapters 3--6 develop finite-sample distribution theory and asymptotic approximations, Chapters 7--8 treat modelling choices and diagnostics, and Chapters 9--10 extend the framework to regularized and generalized linear models.
## What Regression Tries to Estimate
What does it mean to predict a numerical outcome from observed information, and what target should a regression method aim at? The course begins from this decision-theoretic question because regression coefficients are not meaningful until the population quantity being estimated has been specified.
Suppose $(Y, X)$ is a random pair on a probability space $(\Omega, \mathcal F, \mathbb P)$, where $Y$ is a real-valued response and $X$ is a covariate vector taking values in a measurable space $(\mathcal X, \mathcal A)$. A prediction rule is a measurable function $g: \mathcal X \to \mathbb R$. Under squared loss, its risk is
\begin{align*}
R(g) = \mathbb E[(Y - g(X))^2],
\end{align*}
whenever the expectation is finite.
[definition: Regression Function]
Let $(Y,X)$ be a random pair with $\mathbb E[Y^2] < \infty$. A regression function of $Y$ on $X$ is a measurable function $m: \mathcal X \to \mathbb R$ such that
\begin{align*}
m(X) = \mathbb E[Y \mid X]
\end{align*}
almost surely.
[/definition]
The regression function is the population-level prediction target. A fitted regression model is useful only insofar as it estimates, approximates, or interprets this conditional mean structure. To justify that target, we need a variational statement: among all measurable prediction rules based on $X$, squared loss should single out the conditional mean.
[quotetheorem:4421]
[citeproof:4421]
This theorem explains why squared-loss regression is tied to conditional expectation. The assumption $\mathbb E[Y^2]<\infty$ is not a technical decoration: it ensures that squared risks and the orthogonal decomposition in $L^2$ are finite. If $Y$ has no second moment, for instance a heavy-tailed response independent of $X$, the expression $\mathbb E[(Y-g(X))^2]$ can be infinite for every square-integrable rule $g$, so the minimization problem no longer identifies a useful target. The theorem also separates prediction from causation: the conditional mean describes association in the joint distribution of $(Y,X)$, not the effect of intervening on $X$.
[example: Study Hours And Exam Scores]
Let $Y$ be an exam score and let $X$ be the number of hours studied in the preceding week. The regression function is
\begin{align*}
m(x)=\mathbb E[Y\mid X=x],
\end{align*}
so for each fixed study time $x$, $m(x)$ is the population average of the exam score among students whose study time equals $x$. If we restrict attention to affine predictors, we replace the full curve $m(x)$ by a function of the form
\begin{align*}
g(x)=\alpha+\beta x.
\end{align*}
For two study times $x$ and $x+1$, this affine predictor changes by
\begin{align*}
g(x+1)-g(x)
&=\{\alpha+\beta(x+1)\}-(\alpha+\beta x)\\
&=\alpha+\beta x+\beta-\alpha-\beta x\\
&=\beta.
\end{align*}
Thus $\beta$ is the one-hour change in the fitted affine predictor, not automatically the causal effect of forcing a student to study one additional hour. It summarizes the best linear association in the population model being fitted, while causal interpretation would require additional assumptions about how study time is assigned.
[/example]
## Why Linear Models Enter
If the best predictor is the conditional mean, why focus on linear functions at all? The answer is that the unrestricted conditional mean may be too complex to estimate from limited data, while linear classes are interpretable, computable, and geometrically tractable.
A linear regression model restricts prediction rules to an affine or linear span of chosen covariates. In the simplest scalar case this means approximating $m(x)$ by $\alpha+\beta x$. In multiple regression it means choosing functions $x_1,\dots,x_p$ of the observed covariates and approximating by
\begin{align*}
g_\beta(x)=\beta_0+\beta_1x_1+\dots+\beta_px_p.
\end{align*}
The variables $x_j$ may be raw measurements, transformations, interactions, or indicators for categories.
[definition: Linear Regression Function Class]
Let $\phi_0,\phi_1,\dots,\phi_p: \mathcal X \to \mathbb R$ be measurable functions with $\phi_0(x)=1$. The associated linear regression function class is
\begin{align*}
\mathcal G = \left\{g_\beta: \mathcal X \to \mathbb R \mid g_\beta(x)=\sum_{j=0}^p \beta_j\phi_j(x),\ \beta\in\mathbb R^{p+1}\right\}.
\end{align*}
[/definition]
After this restriction, the target changes from the full conditional mean to the best approximation within $\mathcal G$. This distinction between the population regression target and the sample least squares estimate is one of the recurring themes of the course. The next theorem therefore needs a rank condition: if two chosen features carry the same information, then the same fitted function can be represented by more than one coefficient vector.
[quotetheorem:4422]
[citeproof:4422]
The theorem is the population version of the normal equations used in ordinary least squares. Invertibility of $Q$ is needed to identify the coefficient vector, not merely the fitted prediction: if $\phi_1(X)=2\phi_0(X)$ almost surely, then many pairs of coefficients give the same fitted function and $Q$ is singular. The residual is not necessarily independent of the covariates and need not have mean zero conditional on $X$; it is orthogonal only to the chosen finite-dimensional feature space. Thus the theorem says how to find the best linear approximation in the selected features, but it does not say that the conditional mean is linear or that sample-based inference is valid without additional probabilistic assumptions.
[example: Misspecified Linear Approximation]
Let $X\sim \operatorname{Unif}(-1,1)$ and let $Y=X^2$. Since $Y$ is determined by $X$, the conditional mean is
\begin{align*}
m(x)=\mathbb E[Y\mid X=x]=x^2.
\end{align*}
To find the best affine population approximation, minimize
\begin{align*}
L(\alpha,\beta)
&=\mathbb E[(X^2-\alpha-\beta X)^2].
\end{align*}
Expanding the square gives
\begin{align*}
L(\alpha,\beta)
&=\mathbb E[X^4-2\alpha X^2-2\beta X^3+\alpha^2+2\alpha\beta X+\beta^2X^2]\\
&=\mathbb E[X^4]-2\alpha\mathbb E[X^2]-2\beta\mathbb E[X^3]+\alpha^2+2\alpha\beta\mathbb E[X]+\beta^2\mathbb E[X^2].
\end{align*}
For $X\sim \operatorname{Unif}(-1,1)$, its density is $1/2$ on $[-1,1]$, so
\begin{align*}
\mathbb E[X]&=\frac12\int_{-1}^1 x\,dx=0,\\
\mathbb E[X^2]&=\frac12\int_{-1}^1 x^2\,dx=\frac12\left[\frac{x^3}{3}\right]_{-1}^1=\frac13,\\
\mathbb E[X^3]&=\frac12\int_{-1}^1 x^3\,dx=0,\\
\mathbb E[X^4]&=\frac12\int_{-1}^1 x^4\,dx=\frac12\left[\frac{x^5}{5}\right]_{-1}^1=\frac15.
\end{align*}
Substituting these moments into $L(\alpha,\beta)$,
\begin{align*}
L(\alpha,\beta)
&=\frac15-\frac{2\alpha}{3}+\alpha^2+\frac{\beta^2}{3}\\
&=\left(\alpha-\frac13\right)^2+\frac{\beta^2}{3}+\frac15-\frac19.
\end{align*}
The last expression is minimized when $\alpha=1/3$ and $\beta=0$. Thus the best affine population predictor is the constant function $g(x)=1/3$, even though the true conditional mean $m(x)=x^2$ is curved.
[/example]
## From Population Targets to Data
How does the population projection become something computed from observations? The statistical step is to replace expectations by sample averages, producing ordinary least squares. This replacement is necessary because the population law of $(Y,X)$, and hence the expectations in the projection formula, are usually unknown. It also introduces a new possible failure: the sample features can be linearly dependent even when the population target is well-defined.
Let $(Y_i,X_i)_{1\le i\le n}$ be observations, and let $x_i=(1,x_{i1},\dots,x_{ip})^\top\in\mathbb R^{p+1}$ denote the feature vector for observation $i$. Writing $y=(Y_1,\dots,Y_n)^\top$ and placing the row vectors $x_i^\top$ into a design matrix $X\in\mathbb R^{n\times(p+1)}$, the fitted linear model has the form
\begin{align*}
y = X\beta + e,
\end{align*}
where $e$ is the vector of residuals after fitting.
[definition: Ordinary Least Squares Estimator]
Let $X\in\mathbb R^{n\times p}$ be a design matrix and let $y\in\mathbb R^n$ be a response vector. An ordinary least squares estimator is any vector $\hat\beta\in\mathbb R^p$ satisfying
\begin{align*}
\hat\beta \in \operatorname*{argmin}_{\beta\in\mathbb R^p} |y-X\beta|^2,
\end{align*}
where $|\cdot|$ is the Euclidean norm on $\mathbb R^n$.
[/definition]
When $X$ has full column rank, the minimizer is unique and is given by $\hat\beta=(X^\top X)^{-1}X^\top y$. When $X$ is rank-deficient, the fitted values may still be unique even though the coefficient vector is not. For example, duplicate columns in $X$ mean that shifting coefficient mass from one duplicate column to the other leaves $X\beta$ unchanged, so least squares identifies the fitted vector but not the individual duplicated coefficients.
[example: Intercept And Centering]
Suppose the fitted model has an intercept and one numerical covariate, so the least squares criterion is
\begin{align*}
S(a,b)=\sum_{i=1}^n (Y_i-a-bx_i)^2.
\end{align*}
Assume the covariate is not constant, so $\sum_{i=1}^n (x_i-\bar x)^2>0$ and the least squares slope is identified. At a minimizer $(\hat a,\hat b)$, differentiating $S(a,b)$ with respect to $a$ gives
\begin{align*}
0
&=\frac{\partial S}{\partial a}(\hat a,\hat b)\\
&=\sum_{i=1}^n 2(Y_i-\hat a-\hat b x_i)(-1)\\
&=-2\sum_{i=1}^n (Y_i-\hat a-\hat b x_i).
\end{align*}
Therefore
\begin{align*}
0
&=\sum_{i=1}^n (Y_i-\hat a-\hat b x_i)\\
&=\sum_{i=1}^n Y_i-\sum_{i=1}^n \hat a-\sum_{i=1}^n \hat b x_i\\
&=n\bar y-n\hat a-\hat b n\bar x,
\end{align*}
so
\begin{align*}
\hat a=\bar y-\hat b\bar x.
\end{align*}
The fitted line $\hat g(x)=\hat a+\hat b x$ therefore satisfies
\begin{align*}
\hat g(\bar x)
&=\hat a+\hat b\bar x\\
&=(\bar y-\hat b\bar x)+\hat b\bar x\\
&=\bar y.
\end{align*}
Thus the least squares line passes through $(\bar x,\bar y)$.
Now define the centered covariate $z_i=x_i-\bar x$ and fit
\begin{align*}
Y_i=c+dz_i+e_i.
\end{align*}
Since $x_i=z_i+\bar x$, any fitted value from the original parametrization can be rewritten as
\begin{align*}
a+bx_i
&=a+b(z_i+\bar x)\\
&=(a+b\bar x)+bz_i.
\end{align*}
Thus the same fitted values are obtained with
\begin{align*}
c=a+b\bar x,\qquad d=b.
\end{align*}
For the least squares solution, this gives
\begin{align*}
\hat c
&=\hat a+\hat b\bar x\\
&=(\bar y-\hat b\bar x)+\hat b\bar x\\
&=\bar y,
\end{align*}
and $\hat d=\hat b$. Centering therefore changes the intercept from the fitted value at $x=0$ to the fitted value at $x=\bar x$, while leaving the slope and every fitted value unchanged.
[/example]
## Geometry, Probability, and Inference
What extra structure is needed to attach standard errors, confidence intervals, and tests to a least squares fit? The geometry of projection gives the algebra of the fit, while probability assumptions determine the distribution of the estimator.
The first part of the course studies the projection geometry of least squares: column spaces, hat matrices, residuals, rank, and decompositions of sums of squares. The second part adds probabilistic assumptions such as independent errors, constant variance, and sometimes normality. Under these assumptions, the estimator has a sampling distribution that can be used for inference about coefficients and predictions.
[definition: Linear Model With Additive Error]
A linear model with additive error consists of observations satisfying
\begin{align*}
Y_i = x_i^\top\beta + \varepsilon_i, \qquad 1\le i\le n,
\end{align*}
where $x_i\in\mathbb R^p$ are covariate vectors, $\beta\in\mathbb R^p$ is an unknown parameter, and $\varepsilon_i$ are real-valued random errors.
[/definition]
The phrase "linear" refers to linearity in the unknown coefficients, not necessarily linearity in the raw covariates. Polynomial regression, interactions, and categorical regressors are linear models once the feature vector has been chosen. The next issue is inferential rather than notational: under the additive-error model, we need conditions that make the least squares coefficient target unbiased for the unknown parameter.
[quotetheorem:4423]
[citeproof:4423]
This result is a first example of how assumptions translate into inferential claims. If $\mathbb E[\varepsilon]=a$ with $X^\top a\ne 0$, then the same calculation gives bias $(X^\top X)^{-1}X^\top a$, so a systematic error component aligned with the design is inherited by the estimator. If $X$ is not full rank, the inverse does not exist and the coefficient vector is not identified without an additional convention or constraint. In random-design settings, the corresponding assumption is usually conditional exogeneity, $\mathbb E[\varepsilon\mid X]=0$, which is stronger than merely requiring the unconditional mean $\mathbb E[\varepsilon]=0$. Later chapters ask what happens when errors are heteroskedastic, correlated, non-normal, or when the linear mean model itself is misspecified.
## Roadmap of the Course
What will the course build after this introduction? The sequence is designed to move from prediction targets to estimation formulas, then from formulas to inference and model criticism.
The next chapter treats regression as prediction and projection in $L^2$, making the conditional mean and best linear predictor precise. Ordinary least squares is then developed in matrix form, including fitted values, residuals, rank conditions, and the geometry of orthogonal projection. The course then studies the Gauss--Markov theorem, normal linear models, confidence intervals, hypothesis tests, model comparison, diagnostics, transformations, and asymptotic theory. Chapter 9 introduces regularization for unstable or high-dimensional linear fits, Chapter 10 extends the linear-model viewpoint to non-Gaussian responses with logistic regression as the main example, and Chapter 11 closes with prediction, validation, and communication.
[remark: Three Levels Of Regression]
Throughout the course it is useful to keep three levels separate. The population level concerns $\mathbb E[Y\mid X]$ and the best approximation within a chosen function class. The sample level concerns the computed estimator $\hat\beta$ and fitted values $X\hat\beta$. The inferential level concerns the probability distribution of $\hat\beta$ under assumptions on the data-generating mechanism.
[/remark]
The opening chapter separated three levels of regression: the population target, the sample estimator, and the distribution used for inference. That distinction now becomes concrete as we move from abstract statements about $E[Y\mid X]$ to the question of how least squares turns covariates into a prediction rule.
# 1. Regression as Prediction and Projection
Regression begins with a prediction question. We observe a response variable whose future value we would like to predict, together with covariates that carry information about it. The first point of the course is that least squares is not merely an algebraic fitting rule: at the population level it is an orthogonal projection problem in an $L^2$ space. Building on Chapter 0's distinction between population, sample, and inferential levels, this chapter develops the probabilistic target of regression before Chapter 2 introduces the finite-sample matrix machinery.
## Prediction With Squared Loss
What should it mean to predict a random response $Y$ using information contained in covariates $X$? The answer depends on the loss function. In this course, the basic loss is squared error, so a good prediction is one with small mean squared discrepancy from $Y$.
Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A response is a real-valued random variable $Y$, and a covariate vector is a random vector $X: \Omega \to \mathbb R^p$. We assume $\mathbb E[Y^2] < \infty$ throughout this chapter, and when covariates are used in squared-loss formulas we assume the relevant second moments are finite.
[definition: Squared Prediction Risk]
Let $Y \in L^2(\Omega, \mathcal F, \mathbb P)$ and let $X: \Omega \to \mathbb R^p$ be a random vector. Define
\begin{align*}
\mathcal G_X=\{g:\mathbb R^p\to\mathbb R : g \text{ is measurable and } g(X)\in L^2(\Omega,\mathcal F,\mathbb P)\}.
\end{align*}
The squared prediction risk for predicting $Y$ from $X$ is the map $R:\mathcal G_X\to\mathbb R$ given by
\begin{align*}
R(g) = \mathbb E[(Y - g(X))^2].
\end{align*}
[/definition]
The function $g$ is a rule that turns observed covariates into a predicted response. The risk is a population quantity: it averages over the joint distribution of $(X,Y)$, not over a particular data set.
[example: Conditional Information Can Beat The Global Mean]
Suppose $X$ takes the values $0$ and $10$ with probability $1/2$ each. Given $X=0$, assume $\mathbb E[Y\mid X=0]=60$ and $\operatorname{Var}(Y\mid X=0)=25$; given $X=10$, assume $\mathbb E[Y\mid X=10]=80$ and $\operatorname{Var}(Y\mid X=10)=25$. The marginal mean is
\begin{align*}
\mathbb E[Y]
&=\mathbb E[\mathbb E[Y\mid X]]\\
&=\frac12\cdot 60+\frac12\cdot 80\\
&=70.
\end{align*}
For the constant predictor $g(x)=70$, condition on the two possible values of $X$:
\begin{align*}
R(g)
&=\mathbb E[(Y-70)^2]\\
&=\frac12\mathbb E[(Y-70)^2\mid X=0]
+\frac12\mathbb E[(Y-70)^2\mid X=10].
\end{align*}
When $X=0$,
\begin{align*}
\mathbb E[(Y-70)^2\mid X=0]
&=\mathbb E[((Y-60)+(60-70))^2\mid X=0]\\
&=\mathbb E[(Y-60)^2\mid X=0]
+2(60-70)\mathbb E[Y-60\mid X=0]
+(60-70)^2\\
&=25+2(-10)(60-60)+100\\
&=125.
\end{align*}
Similarly, when $X=10$,
\begin{align*}
\mathbb E[(Y-70)^2\mid X=10]
&=\mathbb E[((Y-80)+(80-70))^2\mid X=10]\\
&=25+2(10)(80-80)+100\\
&=125.
\end{align*}
Therefore
\begin{align*}
R(g)
&=\frac12\cdot 125+\frac12\cdot 125\\
&=125.
\end{align*}
Now use the covariate-aware rule $h(0)=60$ and $h(10)=80$. Since $h(X)=60$ on $\{X=0\}$ and $h(X)=80$ on $\{X=10\}$,
\begin{align*}
R(h)
&=\mathbb E[(Y-h(X))^2]\\
&=\frac12\mathbb E[(Y-60)^2\mid X=0]
+\frac12\mathbb E[(Y-80)^2\mid X=10]\\
&=\frac12\cdot 25+\frac12\cdot 25\\
&=25.
\end{align*}
Using $X$ removes the between-group error $100$ paid by the global mean predictor in each group, leaving only the within-group variance $25$.
[/example]
The statistical learning problem adds data. We do not know the population distribution of $(X,Y)$, but we observe repeated draws from it and use them to construct a prediction rule.
[definition: Random Sample]
A random sample of size $n$ from the joint distribution of $(X,Y)$ is a collection
\begin{align*}
(X_1,Y_1),\dots,(X_n,Y_n)
\end{align*}
of independent random vectors with the same distribution as $(X,Y)$.
[/definition]
The sample is the information available to the statistician; the risk is the population criterion we would like to make small. Much of regression theory studies how well a sample-chosen rule approximates the population rule that would be chosen if the joint law were known.
[definition: Regression Function]
Let $Y \in L^2(\Omega, \mathcal F, \mathbb P)$ and let $X: \Omega \to \mathbb R^p$ be a random vector. The conditional mean regression function is any measurable function $m: \mathbb R^p \to \mathbb R$ satisfying
\begin{align*}
m(X)=\mathbb E[Y \mid X]
\end{align*}
a.s.
[/definition]
The notation $\mathbb E[Y \mid X]$ means $\mathbb E[Y \mid \sigma(X)]$. The conditional mean is defined up to changes on sets of $X$-probability zero, which is enough for prediction because $g(X)$ is unchanged a.s. by such modifications. What remains is to connect this conditional-expectation object to the optimization problem: squared-loss regression should choose it because it minimizes population risk over all $X$-measurable predictors.
[quotetheorem:4421]
[citeproof:4421]
This theorem says that squared loss makes the conditional mean the population target. The assumption $Y\in L^2$ is not cosmetic: it ensures that risks and orthogonality statements are finite, so that the projection calculation takes place inside a Hilbert space. The theorem is also specific to squared loss. Under absolute loss, the corresponding population target is a conditional median rather than a conditional mean.
[example: Squared Loss And Absolute Loss Have Different Targets]
Let $X$ be constant, so a prediction rule is just a constant $a\in\mathbb R$. For $Y\sim\operatorname{Exp}(1)$, the density is $f(y)=e^{-y}$ for $y\ge 0$. The squared-loss risk is
\begin{align*}
\mathbb E[(Y-a)^2]
&=\mathbb E[Y^2]-2a\mathbb E[Y]+a^2.
\end{align*}
Using integration by parts,
\begin{align*}
\mathbb E[Y]
&=\int_0^\infty y e^{-y}\,dy
=\left[-ye^{-y}\right]_0^\infty+\int_0^\infty e^{-y}\,dy
=0+\left[-e^{-y}\right]_0^\infty
=1,
\end{align*}
and
\begin{align*}
\mathbb E[Y^2]
&=\int_0^\infty y^2 e^{-y}\,dy
=\left[-y^2e^{-y}\right]_0^\infty+2\int_0^\infty y e^{-y}\,dy
=0+2
=2.
\end{align*}
Therefore
\begin{align*}
\mathbb E[(Y-a)^2]
&=2-2a+a^2\\
&=(a-1)^2+1,
\end{align*}
so the unique squared-loss optimal constant predictor is $a=1=\mathbb E[Y]$.
For absolute loss, if $a<0$, then $Y\ge 0$ implies $|Y-a|=Y-a$, so
\begin{align*}
\mathbb E[|Y-a|]
&=\mathbb E[Y]-a\\
&=1-a\\
&>1
=\mathbb E[|Y-0|].
\end{align*}
Thus no negative $a$ can minimize absolute loss. For $a\ge 0$,
\begin{align*}
\mathbb E[|Y-a|]
&=\int_0^a (a-y)e^{-y}\,dy+\int_a^\infty (y-a)e^{-y}\,dy\\
&=a\int_0^a e^{-y}\,dy-\int_0^a ye^{-y}\,dy+\int_a^\infty ye^{-y}\,dy-a\int_a^\infty e^{-y}\,dy.
\end{align*}
The four pieces are
\begin{align*}
a\int_0^a e^{-y}\,dy&=a(1-e^{-a}),\\
\int_0^a ye^{-y}\,dy&=\left[-(y+1)e^{-y}\right]_0^a=1-(a+1)e^{-a},\\
\int_a^\infty ye^{-y}\,dy&=\left[-(y+1)e^{-y}\right]_a^\infty=(a+1)e^{-a},\\
a\int_a^\infty e^{-y}\,dy&=ae^{-a}.
\end{align*}
Substituting gives
\begin{align*}
\mathbb E[|Y-a|]
&=a(1-e^{-a})-\{1-(a+1)e^{-a}\}+(a+1)e^{-a}-ae^{-a}\\
&=a-1+2e^{-a}.
\end{align*}
Differentiating on $[0,\infty)$,
\begin{align*}
\frac{d}{da}\{a-1+2e^{-a}\}
&=1-2e^{-a},
\end{align*}
so the critical point satisfies
\begin{align*}
1-2e^{-a}&=0,\\
e^{-a}&=\frac12,\\
a&=\log 2.
\end{align*}
Since
\begin{align*}
\frac{d^2}{da^2}\{a-1+2e^{-a}\}
&=2e^{-a}>0,
\end{align*}
this critical point is the unique absolute-loss minimizer. Equivalently, it is the median, because
\begin{align*}
\mathbb P(Y\le \log 2)
&=1-e^{-\log 2}
=1-\frac12
=\frac12.
\end{align*}
Thus squared loss targets the mean $1$, while absolute loss targets the median $\log 2$; changing the loss changes the population target even with the same distribution of $Y$.
[/example]
The result also does not say how to estimate $\mathbb E[Y\mid X]$ from finitely many observations. It identifies the population object that regression would pursue if the joint law of $(X,Y)$ were known. Before imposing linear or affine restrictions, we need one more population identity that separates the unavoidable conditional noise from the extra error caused by choosing a predictor different from the conditional mean.
[quotetheorem:4425]
[citeproof:4425]
The decomposition separates irreducible noise from approximation error. Even the best possible function of $X$ cannot remove the conditional variance of $Y$ given $X$. The formula is a population identity, not the finite-sample bias-variance decomposition for an estimator trained on data; no randomness from a training algorithm appears here. The second-moment assumptions are also essential. If $Y$ has a Cauchy distribution, then $\mathbb E[Y^2]$ is not finite and the squared risk of constant predictors is not a finite criterion, so this $L^2$ decomposition is unavailable. In contrast, a deterministic relation $Y=m(X)$ has $\operatorname{Var}(Y\mid X)=0$, and all remaining risk comes from choosing a predictor different from $m(X)$.
## Simple Linear Regression As A Population Problem
If the best unrestricted predictor is the conditional mean, why do we fit straight lines? The answer is that a straight line is the best predictor inside a restricted class. Simple linear regression begins by projecting $Y$ onto the affine span of $1$ and a single covariate $X$.
Assume now that $X$ and $Y$ are real-valued random variables with $\mathbb E[X^2]<\infty$ and $\mathbb E[Y^2]<\infty$. An affine predictor has the form $a+bX$, with $a,b\in\mathbb R$. The population simple linear regression coefficients are the values that minimize
\begin{align*}
Q(a,b)=\mathbb E[(Y-a-bX)^2].
\end{align*}
[definition: Population Simple Linear Regression Coefficients]
Let $X,Y \in L^2(\Omega, \mathcal F, \mathbb P)$. A pair $(\alpha,\beta)\in\mathbb R^2$ is a pair of population simple linear regression coefficients if
\begin{align*}
(\alpha,\beta) \in \operatorname*{argmin}_{a,b\in\mathbb R}\mathbb E[(Y-a-bX)^2].
\end{align*}
[/definition]
When $\operatorname{Var}(X)>0$, the minimizer is unique. If $\operatorname{Var}(X)=0$, the covariate carries no variation, so only the combined constant prediction $a+bX$ is identified.
To use the definition, we need more than existence of a best affine predictor; we need a computable description of it. The key obstruction is that the two normal equations can determine both intercept and slope only when the covariate has genuine variation. The following result records exactly how the covariance structure controls the fitted line and what degenerates when $X$ is almost surely constant.
[quotetheorem:4426]
[citeproof:4426]
The residual $Y-\alpha-\beta X$ is orthogonal in expectation to both $1$ and $X$. The condition $\operatorname{Var}(X)>0$ is what makes the slope identifiable: if $X=c$ a.s., then $a+bX=a+bc$ is only a constant predictor, and infinitely many pairs $(a,b)$ give the same fitted random variable. The normal equations also do not imply that the residual is independent of $X$, nor that the linear relation is causal. They are first-order optimality equations for the chosen affine prediction class, and the sample normal equations in the next chapter are finite-dimensional analogues obtained by replacing population inner products with empirical ones.
[example: Centering Reduces The Normal Equations]
Suppose $X$ is study hours and $Y$ is exam score, with $\mathbb E[X]=8$, $\mathbb E[Y]=70$, $\operatorname{Var}(X)=9$, and $\operatorname{Cov}(X,Y)=18$. Define the centered variables
\begin{align*}
X_c&=X-8, &
Y_c&=Y-70.
\end{align*}
Then
\begin{align*}
\mathbb E[X_c]&=\mathbb E[X]-8=8-8=0,\\
\mathbb E[Y_c]&=\mathbb E[Y]-70=70-70=0.
\end{align*}
The intercept normal equation gives
\begin{align*}
0&=\mathbb E[Y-\alpha-\beta X]\\
&=\mathbb E[Y]-\alpha-\beta\mathbb E[X]\\
&=70-\alpha-8\beta,
\end{align*}
so
\begin{align*}
\alpha=70-8\beta.
\end{align*}
With this value of $\alpha$,
\begin{align*}
Y-\alpha-\beta X
&=Y-(70-8\beta)-\beta X\\
&=Y-70+8\beta-\beta X\\
&=(Y-70)-\beta(X-8)\\
&=Y_c-\beta X_c.
\end{align*}
Thus the slope is determined by the centered orthogonality equation
\begin{align*}
0&=\mathbb E[X_c(Y_c-\beta X_c)]\\
&=\mathbb E[X_cY_c]-\beta\mathbb E[X_c^2].
\end{align*}
Because $\mathbb E[X_c]=\mathbb E[Y_c]=0$,
\begin{align*}
\mathbb E[X_cY_c]&=\operatorname{Cov}(X_c,Y_c)=\operatorname{Cov}(X,Y)=18,\\
\mathbb E[X_c^2]&=\operatorname{Var}(X_c)=\operatorname{Var}(X)=9.
\end{align*}
Substituting these values gives
\begin{align*}
0&=18-9\beta,\\
9\beta&=18,\\
\beta&=2.
\end{align*}
Finally,
\begin{align*}
\alpha&=70-8\beta\\
&=70-8\cdot 2\\
&=70-16\\
&=54.
\end{align*}
Centering removes the intercept from the slope calculation: the slope comes from covariance over variance, and the intercept is then restored by forcing the fitted line to match the mean of $Y$.
[/example]
The population regression line is not yet an estimator. It is the ideal target determined by the joint distribution of $(X,Y)$. To compute a line from data, the population criterion must be replaced by an observed least-squares criterion, which leads to the sample coefficients.
[definition: Sample Simple Linear Regression Coefficients]
Given observations $(x_1,y_1),\dots,(x_n,y_n)\in\mathbb R^2$, sample simple linear regression coefficients are any minimizers $(\hat\alpha,\hat\beta)$ of
\begin{align*}
S_n(a,b)=\sum_{i=1}^n (y_i-a-bx_i)^2.
\end{align*}
[/definition]
The sample coefficients are random when the observations come from a random sample. Their behaviour as estimators of $(\alpha,\beta)$ is a later topic; for now the distinction is that population regression minimizes expected loss, while sample regression minimizes observed loss.
[example: Population Versus Sample Regression]
Consider one concrete sample of ten students with study hours $x_i=4,5,\dots,13$ and scores
\begin{align*}
y_i=50+2.4x_i.
\end{align*}
Then
\begin{align*}
\bar x
&=\frac{4+5+\cdots+13}{10}
=\frac{85}{10}
=8.5,
\end{align*}
and
\begin{align*}
\bar y
&=\frac{1}{10}\sum_{i=1}^{10}(50+2.4x_i)\\
&=50+2.4\bar x\\
&=50+2.4\cdot 8.5\\
&=50+20.4\\
&=70.4.
\end{align*}
The centered $x$-values are
\begin{align*}
-4.5,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,4.5,
\end{align*}
so
\begin{align*}
\sum_{i=1}^{10}(x_i-\bar x)^2
&=2(4.5^2+3.5^2+2.5^2+1.5^2+0.5^2)\\
&=2(20.25+12.25+6.25+2.25+0.25)\\
&=2\cdot 41.25\\
&=82.5.
\end{align*}
Since $y_i=50+2.4x_i$ and $\bar y=50+2.4\bar x$,
\begin{align*}
y_i-\bar y
&=(50+2.4x_i)-(50+2.4\bar x)\\
&=2.4(x_i-\bar x).
\end{align*}
Therefore the sample slope is
\begin{align*}
\hat\beta
&=\frac{\sum_{i=1}^{10}(x_i-\bar x)(y_i-\bar y)}
{\sum_{i=1}^{10}(x_i-\bar x)^2}\\
&=\frac{\sum_{i=1}^{10}(x_i-\bar x)\,2.4(x_i-\bar x)}
{\sum_{i=1}^{10}(x_i-\bar x)^2}\\
&=\frac{2.4\sum_{i=1}^{10}(x_i-\bar x)^2}
{\sum_{i=1}^{10}(x_i-\bar x)^2}\\
&=2.4,
\end{align*}
and the sample intercept is
\begin{align*}
\hat\alpha
&=\bar y-\hat\beta\bar x\\
&=70.4-2.4\cdot 8.5\\
&=70.4-20.4\\
&=50.
\end{align*}
Thus this sample produces the fitted line $\hat\alpha+\hat\beta x=50+2.4x$, while the population line remains $54+2X$. The two lines summarize different objects: the sample line minimizes squared error for these ten observed points, while the population line minimizes expected squared prediction error under the full data-generating distribution.
[/example]
## Regression To The Mean
Why does an unusually large value of one variable tend to be followed by a less unusual prediction for another related variable? The population regression formula gives a precise answer when two standardized variables have correlation with magnitude less than $1$.
Let $X,Y\in L^2(\Omega,\mathcal F,\mathbb P)$ have means $\mu_X,\mu_Y$, positive variances $\sigma_X^2,\sigma_Y^2$, and correlation
\begin{align*}
\rho = \frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}.
\end{align*}
Then the population simple linear regression slope is
\begin{align*}
\beta=\frac{\rho\sigma_Y}{\sigma_X},
\end{align*}
and the fitted value can be written as
\begin{align*}
\frac{\alpha+\beta X-\mu_Y}{\sigma_Y}
=\rho\frac{X-\mu_X}{\sigma_X}.
\end{align*}
[example: Galton Style Regression To The Mean]
Suppose $X$ is a standardized parental height measurement and $Y$ is a standardized adult child height measurement. Thus
\begin{align*}
\mathbb E[X]&=0, &
\mathbb E[Y]&=0, &
\operatorname{Var}(X)&=1, &
\operatorname{Var}(Y)&=1,
\end{align*}
and the correlation is $\rho=0.6$. For standardized variables, the population affine regression slope is
\begin{align*}
\beta
&=\frac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)}.
\end{align*}
Since $\rho=\operatorname{Cov}(X,Y)/(\sigma_X\sigma_Y)$ and $\sigma_X=\sigma_Y=1$,
\begin{align*}
\operatorname{Cov}(X,Y)
&=\rho\sigma_X\sigma_Y\\
&=0.6\cdot 1\cdot 1\\
&=0.6.
\end{align*}
Therefore
\begin{align*}
\beta
&=\frac{0.6}{1}\\
&=0.6.
\end{align*}
The intercept is
\begin{align*}
\alpha
&=\mathbb E[Y]-\beta\mathbb E[X]\\
&=0-0.6\cdot 0\\
&=0,
\end{align*}
so the best affine prediction on the standardized scale is
\begin{align*}
\alpha+\beta X=0.6X.
\end{align*}
For parents two standard deviations above the mean, $X=2$, so the fitted child-height value is
\begin{align*}
0.6X
&=0.6\cdot 2\\
&=1.2.
\end{align*}
The predicted child height is still above average, but it is only $1.2$ standard deviations above the mean rather than $2$; the prediction moves toward the mean because $0<0.6<1$.
[/example]
Regression to the mean is not a claim that tall parents cause children to shrink toward average height. It is a statement about prediction under imperfect correlation: extreme observed covariate values usually combine signal with idiosyncratic variation, and the best squared-loss prediction discounts the idiosyncratic part.
[remark: Regression To The Mean Is Not Temporal Reversal]
The phrase can be misleading because it sounds like a deterministic force. In the population regression calculation, it is a geometric fact about the slope of an $L^2$ projection. Whenever $|\rho|<1$, standardized affine predictions are less extreme than the standardized covariate values used to make them.
[/remark]
## Orthogonal Projection In $L^2$
What geometry is hidden inside least squares? The space $L^2(\Omega,\mathcal F,\mathbb P)$ is a Hilbert space, and minimizing mean squared error is the same as finding the closest point in a closed linear subspace.
[illustration:l2-orthogonal-projection]
[definition: $L^2$ Inner Product]
The $L^2$ inner product is the map $(\cdot,\cdot)_{L^2}:L^2(\Omega,\mathcal F,\mathbb P)\times L^2(\Omega,\mathcal F,\mathbb P)\to\mathbb R$ defined for real-valued $U,V\in L^2(\Omega,\mathcal F,\mathbb P)$ by
\begin{align*}
(U,V)_{L^2}=\mathbb E[UV].
\end{align*}
The associated norm is the map $\|\cdot\|_{L^2}:L^2(\Omega,\mathcal F,\mathbb P)\to\mathbb R$ defined by
\begin{align*}
\|U\|_{L^2}=\mathbb E[U^2]^{1/2}.
\end{align*}
[/definition]
With this notation, the risk of a predictor $Z$ is $\|Y-Z\|_{L^2}^2$. Least squares over a class of predictors is nearest-point projection in this norm. To state a general projection principle, we must specify which collections of predictors are large enough to contain their nearest points; this is why closed linear prediction spaces are singled out.
[definition: Closed Linear Prediction Space]
A closed linear prediction space is a closed linear subspace $\mathcal H\subset L^2(\Omega,\mathcal F,\mathbb P)$ whose elements are admissible predictors of $Y$.
[/definition]
For unrestricted prediction from $X$, the natural space is $L^2(\sigma(X))$, the closed subspace of square-integrable random variables that are measurable with respect to $\sigma(X)$. For affine prediction from a scalar $X$, the relevant finite-dimensional space is $\operatorname{span}\{1,X\}$. The point of introducing these spaces is that least squares can now be expressed as a Hilbert-space projection theorem, with orthogonality characterizing the best predictor.
[quotetheorem:4427]
[citeproof:4427]
The best predictor theorem is this projection principle with $\mathcal H=L^2(\sigma(X))$. The simple linear regression normal equations are the same principle with $\mathcal H=\operatorname{span}\{1,X\}$. Closedness of $\mathcal H$ is the key existence hypothesis. In a Hilbert space, a nonclosed linear subspace can have points whose distance to the subspace is an infimum that is not attained, so there need not be a best predictor inside that class. Finite-dimensional spans such as $\operatorname{span}\{1,X\}$ are closed, which is why linear regression avoids this pathology at the population level. The projection statement is also geometric rather than causal: it gives the closest approximation in squared norm, not conditional independence, structural explanation, or invariance under intervention.
[example: Orthogonality Of The Simple Regression Residual]
Let $\hat Y=\alpha+\beta X$ be the population affine projection of $Y$ onto $\operatorname{span}\{1,X\}$, and define the residual
\begin{align*}
\varepsilon=Y-\hat Y=Y-\alpha-\beta X.
\end{align*}
Because $1\in\operatorname{span}\{1,X\}$ and $X\in\operatorname{span}\{1,X\}$, the projection residual is orthogonal to both of these predictors:
\begin{align*}
(\varepsilon,1)_{L^2}&=0, &
(\varepsilon,X)_{L^2}&=0.
\end{align*}
Using the definition of the $L^2$ inner product, these become
\begin{align*}
0
&=(\varepsilon,1)_{L^2}\\
&=\mathbb E[\varepsilon\cdot 1]\\
&=\mathbb E[\varepsilon],
\end{align*}
and
\begin{align*}
0
&=(\varepsilon,X)_{L^2}\\
&=\mathbb E[\varepsilon X]\\
&=\mathbb E[X\varepsilon].
\end{align*}
Substituting $\varepsilon=Y-\alpha-\beta X$ gives
\begin{align*}
\mathbb E[Y-\alpha-\beta X]&=0,\\
\mathbb E[X(Y-\alpha-\beta X)]&=0.
\end{align*}
Thus the two population normal equations are exactly the two orthogonality conditions saying that the residual is perpendicular to the affine prediction space generated by $1$ and $X$.
[/example]
This geometry is the main reason least squares is tractable. Once predictors form a linear space, optimality becomes orthogonality, and orthogonality becomes equations.
## Linearity Restrictions And Misspecification
If least squares targets conditional expectation, why can a linear regression line be wrong? The phrase needs a qualification: unrestricted least squares targets $\mathbb E[Y\mid X]$, while linear least squares targets the projection of $\mathbb E[Y\mid X]$ onto the chosen linear class.
[definition: Linear Specification]
A simple linear regression model is correctly specified in conditional mean if there exist $\alpha,\beta\in\mathbb R$ such that
\begin{align*}
\mathbb E[Y\mid X]=\alpha+\beta X
\end{align*}
a.s.
[/definition]
When this condition holds, the affine least-squares target equals the conditional mean. When it fails, the fitted population line is still well-defined, but it is an approximation rather than the full conditional mean. We therefore need a theorem that identifies exactly what linear least squares targets under misspecification: the orthogonal projection of the conditional mean onto the affine prediction class.
[quotetheorem:4428]
[citeproof:4428]
This result is the conceptual boundary of linear regression. The method is optimal for the target class it is given, but the target class may not contain the true conditional mean. Square integrability is again what makes both $Y$ and $m(X)$ elements of $L^2$, so that the orthogonal decomposition is meaningful. The affine span is finite-dimensional and therefore closed, so the projection exists even when the conditional mean is nonlinear. None of this identifies causal structure: a linear conditional mean can arise from confounding, and a nonlinear conditional mean can still come from a simple causal mechanism after a nonlinear transformation. The next example is the basic misspecification counterexample: the fitted line satisfies all linear normal equations while missing the shape of $\mathbb E[Y\mid X]$.
[example: Nonlinear Conditional Mean And Misspecification]
Let $X\sim\operatorname{Unif}(-1,1)$ and let $Y=X^2+\eta$, where $\mathbb E[\eta\mid X]=0$ and $\eta\in L^2(\Omega,\mathcal F,\mathbb P)$. Since $X$ is bounded and $\eta\in L^2$, we have $Y\in L^2$. The conditional mean is
\begin{align*}
\mathbb E[Y\mid X]
&=\mathbb E[X^2+\eta\mid X]\\
&=\mathbb E[X^2\mid X]+\mathbb E[\eta\mid X]\\
&=X^2+0\\
&=X^2.
\end{align*}
This is not affine in $X$: if $X^2=a+bX$ a.s. on $[-1,1]$, then the polynomial $x^2-a-bx$ would vanish on an interval, hence would be the zero polynomial, impossible because its $x^2$ coefficient is $1$.
We now compute the best affine predictor $a+bX$ by minimizing
\begin{align*}
Q(a,b)&=\mathbb E[(Y-a-bX)^2]\\
&=\mathbb E[(X^2+\eta-a-bX)^2].
\end{align*}
Expanding around the part depending only on $X$ gives
\begin{align*}
Q(a,b)
&=\mathbb E[(X^2-a-bX)^2]
+2\mathbb E[(X^2-a-bX)\eta]
+\mathbb E[\eta^2].
\end{align*}
The mixed term is zero because $X^2-a-bX$ is $\sigma(X)$-measurable:
\begin{align*}
\mathbb E[(X^2-a-bX)\eta]
&=\mathbb E\!\left[\mathbb E[(X^2-a-bX)\eta\mid X]\right]\\
&=\mathbb E\!\left[(X^2-a-bX)\mathbb E[\eta\mid X]\right]\\
&=\mathbb E[(X^2-a-bX)\cdot 0]\\
&=0.
\end{align*}
Thus the minimizing values of $a,b$ are the ones that minimize $\mathbb E[(X^2-a-bX)^2]$. For $X\sim\operatorname{Unif}(-1,1)$,
\begin{align*}
\mathbb E[X]
&=\frac12\int_{-1}^1 x\,dx
=\frac12\left[\frac{x^2}{2}\right]_{-1}^1
=0,\\
\mathbb E[X^2]
&=\frac12\int_{-1}^1 x^2\,dx
=\frac12\left[\frac{x^3}{3}\right]_{-1}^1
=\frac13,\\
\mathbb E[X^3]
&=\frac12\int_{-1}^1 x^3\,dx
=\frac12\left[\frac{x^4}{4}\right]_{-1}^1
=0,\\
\mathbb E[X^4]
&=\frac12\int_{-1}^1 x^4\,dx
=\frac12\left[\frac{x^5}{5}\right]_{-1}^1
=\frac15.
\end{align*}
Therefore
\begin{align*}
\mathbb E[(X^2-a-bX)^2]
&=\mathbb E[X^4-2aX^2-2bX^3+a^2+2abX+b^2X^2]\\
&=\frac15-2a\cdot\frac13-2b\cdot 0+a^2+2ab\cdot 0+b^2\cdot\frac13\\
&=a^2-\frac{2a}{3}+\frac{b^2}{3}+\frac15\\
&=\left(a-\frac13\right)^2+\frac{b^2}{3}+\frac{4}{45}.
\end{align*}
This is minimized uniquely at $a=1/3$ and $b=0$. Hence the best affine predictor is the constant $1/3$, even though the conditional mean is the curved function $X^2$.
[/example]
This example shows why a residual plot against $X$ can reveal structure missed by a linear fit. The residual still satisfies the population orthogonality equations, but it retains systematic nonlinear dependence on $X$.
[remark: Orthogonality Is Weaker Than Independence]
In a population linear regression, the residual is orthogonal to the regressors used in the fit. This does not imply that the residual is independent of the regressors. In the nonlinear example above, $Y-1/3$ is orthogonal to $1$ and $X$, but its conditional mean given $X$ is $X^2-1/3$, so dependence remains.
[/remark]
The chapter has separated three objects that are often conflated: the conditional mean $\mathbb E[Y\mid X]$, the population least-squares projection within a chosen class, and the sample least-squares fit computed from observed data. The next chapter turns the affine projection idea into matrix notation, where multiple covariates, intercepts, rank conditions, fitted values, and residual geometry can be handled at once.
Chapter 1 showed that regression is a projection problem, first in the population and then in finite samples. To work with many covariates at once, we now rewrite that projection in matrix form, where rank, fitted values, and residuals can be handled systematically.
# 2. Ordinary Least Squares in Matrix Form
This chapter turns Chapter 1's $L^2$ projection picture into a finite-dimensional calculation in Euclidean space. Instead of fitting a line with one covariate, we allow many columns of explanatory information and ask when the least squares problem has a well-defined coefficient vector. The central theme is that ordinary least squares is an orthogonal projection of the observed response vector onto the column space of the design matrix.
## Encoding Linear Models with a Design Matrix
How do we put several numerical covariates, categorical choices, and interactions into one regression problem without changing the algebra each time? The answer is to separate the statistical meaning of the variables from the linear algebra object that least squares sees: the design matrix. Once the columns of this matrix are fixed, least squares treats every model as a problem of approximating a vector in $\mathbb R^n$ by a vector in a subspace.
[definition: Multiple Linear Regression Model]
Let $n,p \in \mathbb N$. A multiple linear regression model consists of a response vector $y \in \mathbb R^n$, a design matrix $X \in \mathbb R^{n \times p}$, an unknown coefficient vector $\beta \in \mathbb R^p$, and an error vector $\varepsilon \in \mathbb R^n$ satisfying
\begin{align*}
y = X\beta + \varepsilon.
\end{align*}
[/definition]
The $i$th row of $X$ records the covariate values for observation $i$, while the $j$th column records one feature across all observations. The coefficient $\beta_j$ is the amount by which the fitted mean changes when the $j$th column changes by one unit and all other columns are held fixed, provided that interpretation is meaningful for the chosen encoding. To make this row-and-column convention precise for arbitrary numbers of observations and features, we now define the design matrix itself.
[definition: Design Matrix]
For observations indexed by $i=1,\dots,n$ and features indexed by $j=1,\dots,p$, the design matrix is the matrix
\begin{align*}
X = \begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \\
x_{21} & x_{22} & \cdots & x_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \cdots & x_{np}
\end{pmatrix} \in \mathbb R^{n \times p}.
\end{align*}
[/definition]
A column of ones is used when the model includes an intercept. With covariates $z_{i1},\dots,z_{iq}$, the usual intercept model has row
\begin{align*}
X_i = (1,z_{i1},\dots,z_{iq}),
\end{align*}
so $p=q+1$ and the fitted value at observation $i$ has the form $\hat y_i=\hat\beta_0+\sum_{j=1}^q z_{ij}\hat\beta_j$.
[example: Salary Regression Design]
Suppose $y_i$ is the salary of employee $i$, $e_i$ is years of education, and $r_i$ is years of experience. With an intercept, the design matrix has $i$th row
\begin{align*}
X_i=(1,e_i,r_i),
\end{align*}
so for $\beta=(\beta_0,\beta_1,\beta_2)^\top$ the fitted mean at observation $i$ is the row-vector product
\begin{align*}
(X\beta)_i
&= X_i\beta \\
&= (1,e_i,r_i)
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2
\end{pmatrix} \\
&= 1\cdot \beta_0+e_i\beta_1+r_i\beta_2 \\
&= \beta_0+\beta_1e_i+\beta_2r_i.
\end{align*}
Thus the mean structure is
\begin{align*}
\mathbb E[Y_i\mid e_i,r_i]=\beta_0+\beta_1e_i+\beta_2r_i.
\end{align*}
If two employees have the same experience $r$ but education levels $e$ and $e+1$, their fitted means differ by
\begin{align*}
\{\beta_0+\beta_1(e+1)+\beta_2r\}-\{\beta_0+\beta_1e+\beta_2r\}
&= \beta_0+\beta_1e+\beta_1+\beta_2r-\beta_0-\beta_1e-\beta_2r \\
&= \beta_1.
\end{align*}
Similarly, if two employees have the same education $e$ but experience levels $r$ and $r+1$, their fitted means differ by
\begin{align*}
\{\beta_0+\beta_1e+\beta_2(r+1)\}-\{\beta_0+\beta_1e+\beta_2r\}
&= \beta_0+\beta_1e+\beta_2r+\beta_2-\beta_0-\beta_1e-\beta_2r \\
&= \beta_2.
\end{align*}
So $\beta_1$ is the one-year education comparison with experience held fixed, while $\beta_2$ is the one-year experience comparison with education held fixed; the least squares calculation then uses the three fixed columns $1$, education, and experience exactly as entries of the design matrix.
[/example]
Categorical covariates must be converted into numerical columns before they can enter a design matrix. The basic numerical building block is an indicator for membership in a category; once these indicators are defined, we can explain why an intercept model usually omits one category to avoid exact linear dependence.
The obstruction is that category labels such as control, low dose, and high dose have no inherent arithmetic meaning: subtracting or averaging the labels would depend on an arbitrary coding choice. A dummy variable isolates one category as a yes-or-no numerical feature, so the design matrix records membership without pretending that the labels lie on a quantitative scale.
[definition: Dummy Variable]
Let $C_i$ be a categorical covariate taking values in $\{1,\dots,k\}$. A dummy variable for category $a$ is the numerical feature
\begin{align*}
d_{ia}=\mathbb{1}_{\{C_i=a\}}.
\end{align*}
[/definition]
With an intercept present, the standard encoding keeps $k-1$ dummy variables and treats the omitted category as the reference level. The intercept then represents the fitted mean for the reference category when the remaining covariates are zero, and each dummy coefficient represents a difference from that reference category.
[example: One-Hot Treatment Encoding]
Suppose a medical treatment has three levels: control, low dose, and high dose. With an intercept and control as the reference level, use the design row
\begin{align*}
X_i=(1,d_{i,\mathrm{low}},d_{i,\mathrm{high}}),
\end{align*}
where $d_{i,\mathrm{low}}=1$ exactly for low-dose observations and $d_{i,\mathrm{high}}=1$ exactly for high-dose observations. For $\hat\beta=(\hat\beta_0,\hat\beta_1,\hat\beta_2)^\top$, the fitted mean at observation $i$ is
\begin{align*}
X_i\hat\beta
&=(1,d_{i,\mathrm{low}},d_{i,\mathrm{high}})
\begin{pmatrix}
\hat\beta_0\\
\hat\beta_1\\
\hat\beta_2
\end{pmatrix} \\
&=1\cdot \hat\beta_0+d_{i,\mathrm{low}}\hat\beta_1+d_{i,\mathrm{high}}\hat\beta_2.
\end{align*}
For a control observation, $d_{i,\mathrm{low}}=0$ and $d_{i,\mathrm{high}}=0$, so
\begin{align*}
X_i\hat\beta
&=\hat\beta_0+0\cdot \hat\beta_1+0\cdot \hat\beta_2 \\
&=\hat\beta_0.
\end{align*}
For a low-dose observation, $d_{i,\mathrm{low}}=1$ and $d_{i,\mathrm{high}}=0$, so
\begin{align*}
X_i\hat\beta
&=\hat\beta_0+1\cdot \hat\beta_1+0\cdot \hat\beta_2 \\
&=\hat\beta_0+\hat\beta_1.
\end{align*}
For a high-dose observation, $d_{i,\mathrm{low}}=0$ and $d_{i,\mathrm{high}}=1$, so
\begin{align*}
X_i\hat\beta
&=\hat\beta_0+0\cdot \hat\beta_1+1\cdot \hat\beta_2 \\
&=\hat\beta_0+\hat\beta_2.
\end{align*}
If the control indicator $d_{\mathrm{control}}$ were also included, then for every observation exactly one of $d_{i,\mathrm{control}}$, $d_{i,\mathrm{low}}$, and $d_{i,\mathrm{high}}$ equals $1$, while the other two equal $0$. Hence for each $i$,
\begin{align*}
d_{i,\mathrm{control}}+d_{i,\mathrm{low}}+d_{i,\mathrm{high}}
&=1.
\end{align*}
As a vector identity, this is
\begin{align*}
d_{\mathrm{control}}+d_{\mathrm{low}}+d_{\mathrm{high}}=\mathbf{1}.
\end{align*}
Equivalently,
\begin{align*}
\mathbf{1}-d_{\mathrm{control}}-d_{\mathrm{low}}-d_{\mathrm{high}}=0,
\end{align*}
with coefficients $1,-1,-1,-1$ not all zero, so the four columns are linearly dependent. Omitting the control indicator removes this exact dependence and makes the intercept represent the fitted control mean, while $\hat\beta_1$ and $\hat\beta_2$ represent fitted differences from control.
[/example]
Interactions and transformations expand the column space while preserving linearity in the coefficient vector. A model may be nonlinear as a function of the raw covariate and still be linear regression if it is linear in the unknown coefficients. To handle the common case where the effect of one feature is allowed to vary with another, we add product columns to the design matrix.
[definition: Interaction Column]
Given two numerical feature columns $u,v \in \mathbb R^n$, their interaction column is the feature $w \in \mathbb R^n$ defined by
\begin{align*}
w_i = u_i v_i, \qquad i=1,\dots,n.
\end{align*}
[/definition]
An interaction column allows the slope attached to one variable to depend on the level of another. This is not a new estimation method; it is a new choice of column in $X$.
[example: Polynomial Regression]
Let $t_i \in \mathbb R$ be a scalar covariate and consider the quadratic mean model
\begin{align*}
m(t_i)=\beta_0+\beta_1t_i+\beta_2t_i^2.
\end{align*}
This is ordinary linear regression after choosing the design row
\begin{align*}
X_i=(1,t_i,t_i^2),
\end{align*}
because for $\beta=(\beta_0,\beta_1,\beta_2)^\top$,
\begin{align*}
X_i\beta
&=(1,t_i,t_i^2)
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2
\end{pmatrix} \\
&=1\cdot \beta_0+t_i\beta_1+t_i^2\beta_2 \\
&=\beta_0+\beta_1t_i+\beta_2t_i^2.
\end{align*}
For observed responses $y_1,\dots,y_n$, the residual sum of squares is therefore
\begin{align*}
RSS(\beta_0,\beta_1,\beta_2)
&=\sum_{i=1}^n \left(y_i-X_i\beta\right)^2 \\
&=\sum_{i=1}^n \left(y_i-\beta_0-\beta_1t_i-\beta_2t_i^2\right)^2.
\end{align*}
The curve is nonlinear as a function of $t_i$ because it contains $t_i^2$, but the expression is still linear in the unknown coefficients $\beta_0,\beta_1,\beta_2$ before squaring. Once the observed values $t_1,\dots,t_n$ are fixed, the three columns
\begin{align*}
\begin{pmatrix}1\\ \vdots\\ 1\end{pmatrix},
\qquad
\begin{pmatrix}t_1\\ \vdots\\ t_n\end{pmatrix},
\qquad
\begin{pmatrix}t_1^2\\ \vdots\\ t_n^2\end{pmatrix}
\end{align*}
are fixed numerical vectors, so the least squares problem is the ordinary problem of approximating $y$ by a vector in their span.
[/example]
The ability to add columns creates a basic identifiability question. If two different coefficient vectors produce the same fitted vector $X\beta$, then the fitted values may be determined but the coefficients are not.
[definition: Full Column Rank]
A matrix $X \in \mathbb R^{n \times p}$ has full column rank if
\begin{align*}
\operatorname{rank}(X)=p.
\end{align*}
[/definition]
Full column rank requires $p \le n$ and excludes exact collinearity among the columns. It is a condition on the realised design matrix, not a probabilistic assumption about errors.
[quotetheorem:4429]
[citeproof:4429]
This theorem explains why rank checks are part of model specification: coefficient interpretation depends on the map from coefficients to fitted values being one-to-one. If an intercept and all category indicators are included together, increasing the intercept while decreasing every category coefficient by the same amount leaves $X\beta$ unchanged, so the separate coefficients cannot be identified. The theorem does not say that rank failure prevents prediction, since the fitted vector may still be determined by projection onto $\operatorname{Range}(X)$. The next section uses this distinction by first solving for fitted values through a minimization problem, then returning to full column rank when a unique coefficient formula is needed.
## Least Squares and the Normal Equations
Given $X$ and $y$, which fitted vector in $\operatorname{Range}(X)$ should represent the data? Ordinary least squares chooses the vector $X\hat\beta$ whose squared Euclidean distance from $y$ is as small as possible. The coefficient vector is therefore found by minimizing the residual sum of squares.
[definition: Residual Sum of Squares]
For $X \in \mathbb R^{n \times p}$ and $y \in \mathbb R^n$, the residual sum of squares is the function
\begin{align*}
RSS: \mathbb R^p &\to \mathbb R, \\
\beta &\mapsto |y-X\beta|^2 = \sum_{i=1}^n (y_i-(X\beta)_i)^2.
\end{align*}
[/definition]
The least squares estimator is a minimizer of this function. The use of Euclidean norm makes the problem both geometric and computational: geometrically, we seek the closest point in a subspace; computationally, we solve a quadratic minimization problem.
[definition: Ordinary Least Squares Estimator]
An ordinary least squares estimator for $\beta$ is any vector $\hat\beta \in \mathbb R^p$ satisfying
\begin{align*}
RSS(\hat\beta) = \inf_{b \in \mathbb R^p} RSS(b).
\end{align*}
[/definition]
The objective function can be expanded as
\begin{align*}
RSS(\beta)
&= (y-X\beta)^\top (y-X\beta) \\
&= y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta.
\end{align*}
This expansion shows that least squares is a quadratic optimization problem. The condition we need next is the coordinate-free algebraic test for a coefficient vector to minimize this quadratic: the residual must be orthogonal to every column direction in the design matrix.
[quotetheorem:501]
[citeproof:501]
The normal equations say that the residual vector is orthogonal to every column of $X$. This condition is stronger than merely making the residuals small in length: it rules out any remaining first-order improvement in a column direction. If a candidate fit has residuals with nonzero inner product against a covariate column, then moving the coefficient in that direction decreases $RSS$ for small enough movement, so the candidate cannot be least squares. The theorem does not require full column rank, so it characterizes all minimizers even when the coefficient vector is non-unique. This orthogonality viewpoint is the route to the fitted values, residuals, and hat matrix introduced in the next section.
[definition: Fitted Values and Residuals]
For an ordinary least squares estimator $\hat\beta$, the fitted value vector and residual vector are
\begin{align*}
\hat y = X\hat\beta, \qquad \hat\varepsilon = y - \hat y.
\end{align*}
[/definition]
With an intercept column, residual orthogonality to the intercept gives $\sum_{i=1}^n \hat\varepsilon_i=0$. Orthogonality to a covariate column says that the residuals have zero empirical correlation with that column before centering conventions are applied.
The orthogonality equations identify least-squares fits, but they do not by themselves say when the coefficient vector is unique or when the inverse formula is valid. The next formal result separates those existence and uniqueness issues.
[quotetheorem:4430]
[citeproof:4430]
When $X$ lacks full column rank, fitted values can still be unique even though coefficients need not be. For instance, two identical covariate columns allow the same fitted vector to be represented by many pairs of coefficients whose sum is fixed. The theorem therefore separates existence of a least squares fit from uniqueness of the coefficient vector. It also shows exactly where the familiar inverse formula can fail: $X^\top X$ is singular precisely when the columns of $X$ are linearly dependent. Later inference for individual coefficients requires care in that case, which is why the full column rank assumption is standard in the basic theory.
## Projection, Hat Matrix, and Leverage
What is the geometric operation performed by least squares? It takes the observed vector $y$ and projects it orthogonally onto the subspace spanned by the columns of $X$. The matrix carrying out this projection is called the hat matrix because it maps $y$ to $\hat y$.
[definition: Hat Matrix]
If $X \in \mathbb R^{n \times p}$ has full column rank, the hat matrix is the linear map $H: \mathbb R^n \to \mathbb R^n$ represented by
\begin{align*}
H = X(X^\top X)^{-1}X^\top \in \mathbb R^{n \times n}.
\end{align*}
[/definition]
Then $\hat y=Hy$ and $\hat\varepsilon=(I-H)y$. The matrix $H$ depends only on the design matrix, not on the observed response values.
Once the fitted vector is written as $Hy$, the next issue is whether this matrix really has the geometric properties expected of an orthogonal projection. In particular, the formula should preserve vectors already in the model space and remove only the component perpendicular to that space. Verifying symmetry and idempotence turns the computational formula for $H$ into a geometric object that can support leverage and residual diagnostics.
[quotetheorem:4431]
[citeproof:4431]
This theorem is the bridge between linear algebra and statistical interpretation. The fitted vector is not produced by independently adjusting each observation; it is produced by projecting the whole response vector onto the model subspace. The full column rank assumption is used here to write the projection through $(X^\top X)^{-1}$; without it, the projection onto $\operatorname{Range}(X)$ still exists, but this particular inverse formula is unavailable. A useful counterexample is a design with two identical columns: the model subspace is unchanged if one copy is removed, while the coefficient representation becomes non-unique. The next question is how this projection treats individual observations, which leads immediately to the diagonal entries of $H$, called leverage, and later to the influence diagnostics of Chapter 8.
[definition: Leverage]
Assume $X \in \mathbb R^{n \times p}$ has full column rank and hat matrix $H$. The leverage of observation $i$ is the diagonal entry
\begin{align*}
h_{ii}=H_{ii}.
\end{align*}
[/definition]
Leverage measures how strongly the fitted value $\hat y_i$ can respond to the observed value $y_i$ through the projection matrix. The diagonal entries satisfy $0\le h_{ii}\le 1$ and $\sum_{i=1}^n h_{ii}=p$, so average leverage is $p/n$.
Before leverage can be used as a diagnostic, we need to know that these numbers are constrained by the geometry of projection rather than by a convention of terminology. If diagonal entries could be negative, exceed one, or have no fixed total, then comparing $h_{ii}$ with $p/n$ would have no stable meaning. The following result supplies the bounds and trace identity that make leverage a calibrated measure of how exceptional a design row is.
[quotetheorem:4432]
[citeproof:4432]
A high-leverage point has an unusual row of the design matrix relative to the other rows. It need not have a large residual, but it has the potential to exert substantial influence on the fitted regression surface. The trace identity also shows that leverage is a finite resource: if some observations have leverage far above $p/n$, others must have leverage below average. The bounds $0\le h_{ii}\le 1$ rely on $H$ being an orthogonal projection; an arbitrary smoothing matrix need not have diagonal entries with this interpretation. These facts motivate looking at leverage separately from residual size when diagnosing a fitted regression.
[example: Leverage in a Simple Intercept Model]
Consider the intercept-only model $X=\mathbf{1}\in\mathbb R^{n\times 1}$, where $\mathbf{1}=(1,\dots,1)^\top$. Then
\begin{align*}
X^\top X
&=\mathbf{1}^\top\mathbf{1}
=\sum_{i=1}^n 1
=n,
\end{align*}
so the hat matrix is
\begin{align*}
H
&=X(X^\top X)^{-1}X^\top \\
&=\mathbf{1}\,n^{-1}\mathbf{1}^\top \\
&=\frac{1}{n}\mathbf{1}\mathbf{1}^\top.
\end{align*}
Thus the fitted vector is
\begin{align*}
\hat y
&=Hy \\
&=\frac{1}{n}\mathbf{1}\mathbf{1}^\top y \\
&=\frac{1}{n}\mathbf{1}\sum_{j=1}^n y_j \\
&=\bar y\,\mathbf{1},
\end{align*}
so every fitted value equals the sample mean $\bar y$. Since every diagonal entry of $\mathbf{1}\mathbf{1}^\top$ is $1$, each leverage value is
\begin{align*}
h_{ii}
&=H_{ii}
=\frac{1}{n}.
\end{align*}
For comparison, in simple regression with intercept and one covariate $x_i$, the row is $X_i=(1,x_i)$. Write $\bar x=n^{-1}\sum_{j=1}^n x_j$ and $S_{xx}=\sum_{j=1}^n(x_j-\bar x)^2>0$. Then
\begin{align*}
X^\top X
&=
\begin{pmatrix}
n & \sum_{j=1}^n x_j\\
\sum_{j=1}^n x_j & \sum_{j=1}^n x_j^2
\end{pmatrix},
\end{align*}
and its determinant is
\begin{align*}
n\sum_{j=1}^n x_j^2-\left(\sum_{j=1}^n x_j\right)^2
&=n\sum_{j=1}^n x_j^2-n^2\bar x^2 \\
&=n\left(\sum_{j=1}^n x_j^2-n\bar x^2\right) \\
&=nS_{xx}.
\end{align*}
Hence
\begin{align*}
(X^\top X)^{-1}
&=\frac{1}{nS_{xx}}
\begin{pmatrix}
\sum_{j=1}^n x_j^2 & -\sum_{j=1}^n x_j\\
-\sum_{j=1}^n x_j & n
\end{pmatrix}.
\end{align*}
The leverage of observation $i$ is therefore
\begin{align*}
h_{ii}
&=X_i(X^\top X)^{-1}X_i^\top \\
&=(1,x_i)
\frac{1}{nS_{xx}}
\begin{pmatrix}
\sum_{j=1}^n x_j^2 & -\sum_{j=1}^n x_j\\
-\sum_{j=1}^n x_j & n
\end{pmatrix}
\begin{pmatrix}
1\\
x_i
\end{pmatrix} \\
&=\frac{\sum_{j=1}^n x_j^2-2x_i\sum_{j=1}^n x_j+nx_i^2}{nS_{xx}} \\
&=\frac{\sum_{j=1}^n x_j^2-2nx_i\bar x+nx_i^2}{nS_{xx}} \\
&=\frac{S_{xx}+n\bar x^2-2nx_i\bar x+nx_i^2}{nS_{xx}} \\
&=\frac{S_{xx}+n(x_i-\bar x)^2}{nS_{xx}} \\
&=\frac{1}{n}+\frac{(x_i-\bar x)^2}{S_{xx}}.
\end{align*}
Thus an observation whose covariate value lies far from $\bar x$ has larger leverage because the second term increases with $(x_i-\bar x)^2$. Leverage is determined entirely by the design matrix $X$, whereas residual size also depends on the response vector $y$.
[/example]
## Orthogonality and Sums of Squares
How does least squares split variation in the data into fitted and residual parts? Orthogonal projection gives a Pythagorean decomposition. With an intercept, this becomes the ANOVA identity that total variation around the sample mean equals explained variation plus residual variation.
[definition: Total, Explained, and Residual Sums of Squares]
Assume the model contains an intercept column and let
\begin{align*}
\bar y = \frac{1}{n}\sum_{i=1}^n y_i.
\end{align*}
The total sum of squares, explained sum of squares, and residual sum of squares at the least squares fit are
\begin{align*}
TSS &= \sum_{i=1}^n (y_i-\bar y)^2, \\
ESS &= \sum_{i=1}^n (\hat y_i-\bar y)^2, \\
RSS &= \sum_{i=1}^n (y_i-\hat y_i)^2.
\end{align*}
[/definition]
The intercept is essential for this exact centered decomposition because it ensures that the constant vector $\bar y\mathbf{1}$ lies in the model space and that residuals sum to zero.
Before decomposing variation around the sample mean, we need the basic orthogonality relation between fitted values and residuals. Least squares is not merely making residuals small coordinate by coordinate; it removes every component of the response that lies in directions available to the design. This is the geometric fact that makes later sums-of-squares identities possible.
[quotetheorem:4433]
[citeproof:4433]
Residual orthogonality is often the simplest way to check algebra in least squares calculations. If residuals still have a component in the direction of a design column, then the fitted vector has not fully used a direction that the model allows. The result does not say that residuals are orthogonal to the observed response $y$ itself, nor to variables that were not included as columns of $X$. It also explains why adding a column to the design can only decrease the residual sum of squares: the projection subspace becomes larger. This orthogonal splitting is the mechanism behind the sums-of-squares identity for intercept models.
The quantities $TSS$, $ESS$, and $RSS$ are useful only if they fit together as parts of one centered geometric decomposition. The possible pitfall is that centering at $\bar y\mathbf{1}$ introduces a constant direction, so the identity depends on the intercept being inside the model space. The next theorem isolates the precise condition under which explained and residual variation add back to total variation.
[quotetheorem:4434]
[citeproof:4434]
This identity justifies the language of explained and residual variation in the intercept model. The intercept hypothesis is not cosmetic: without the constant column, $\bar y\mathbf{1}$ need not lie in $\operatorname{Range}(X)$, so the two vectors in the displayed decomposition need not be orthogonal. For example, a regression forced through the origin generally uses variation around zero rather than variation around the sample mean, and the centered formula can fail. The theorem does not say that a larger $ESS$ proves a better causal model; it only records a geometric decomposition for the chosen design matrix. This prepares the later use of $R^2$, adjusted fit measures, and model comparison, where the same identity must be interpreted with care.
[example: Salary Regression Sums of Squares]
Let $y_i$ be salary, $e_i$ education, and $r_i$ experience, with design row $X_i=(1,e_i,r_i)$. The fitted vector $\hat y=X\hat\beta$ lies in the subspace
\begin{align*}
\operatorname{Range}(X)=\operatorname{span}\{\mathbf{1},e,r\},
\end{align*}
so the three sums of squares are
\begin{align*}
TSS&=\sum_{i=1}^n(y_i-\bar y)^2,\\
ESS&=\sum_{i=1}^n(\hat y_i-\bar y)^2,\\
RSS&=\sum_{i=1}^n(y_i-\hat y_i)^2.
\end{align*}
Because the intercept column is included, $\bar y\mathbf{1}\in\operatorname{Range}(X)$. Also $\hat y\in\operatorname{Range}(X)$, so $\hat y-\bar y\mathbf{1}\in\operatorname{Range}(X)$. The least squares residual $\hat\varepsilon=y-\hat y$ satisfies
\begin{align*}
X^\top\hat\varepsilon=0,
\end{align*}
so it is orthogonal to every vector in $\operatorname{Range}(X)$, including $\hat y-\bar y\mathbf{1}$. Hence
\begin{align*}
y-\bar y\mathbf{1}
&=(\hat y-\bar y\mathbf{1})+(y-\hat y),\\
|y-\bar y\mathbf{1}|^2
&=|\hat y-\bar y\mathbf{1}|^2+2(\hat y-\bar y\mathbf{1})^\top(y-\hat y)+|y-\hat y|^2\\
&=|\hat y-\bar y\mathbf{1}|^2+2\cdot 0+|y-\hat y|^2\\
&=|\hat y-\bar y\mathbf{1}|^2+|y-\hat y|^2.
\end{align*}
Written componentwise, this is exactly
\begin{align*}
\sum_{i=1}^n(y_i-\bar y)^2
=
\sum_{i=1}^n(\hat y_i-\bar y)^2
+
\sum_{i=1}^n(y_i-\hat y_i)^2,
\end{align*}
so $TSS=ESS+RSS$.
Now compare the model using only education with the model using both education and experience. The education-only design row is $(1,e_i)$, while the larger design row is $(1,e_i,r_i)$. For any education-only coefficients $(a,b)^\top$,
\begin{align*}
(1,e_i,r_i)
\begin{pmatrix}
a\\
b\\
0
\end{pmatrix}
&=a+be_i+0\cdot r_i\\
&=a+be_i\\
&=(1,e_i)
\begin{pmatrix}
a\\
b
\end{pmatrix}.
\end{align*}
Thus every fitted vector available in the education-only model is also available in the larger model by setting the experience coefficient equal to $0$. Therefore
\begin{align*}
RSS_{\mathrm{educ+exper}}
&=\min_{a,b,c}\sum_{i=1}^n(y_i-a-be_i-cr_i)^2\\
&\le \min_{a,b}\sum_{i=1}^n(y_i-a-be_i-0\cdot r_i)^2\\
&=\min_{a,b}\sum_{i=1}^n(y_i-a-be_i)^2\\
&=RSS_{\mathrm{educ}}.
\end{align*}
Adding experience can leave the residual sum of squares unchanged or reduce it, but it cannot increase it, because the larger model minimizes over a larger set of fitted vectors.
[/example]
The chapter's main message is that the formulas of ordinary least squares are consequences of projection onto $\operatorname{Range}(X)$. Rank conditions make coefficient estimates unique, the normal equations express residual orthogonality, the hat matrix performs the projection, and the ANOVA identity is the corresponding Pythagorean theorem in an intercept model.
The matrix formulation makes the geometry of least squares explicit: coefficients come from projecting onto $\operatorname{Range}(X)$, and the hat matrix records that projection. With that algebra in hand, we can now ask what these formulas estimate when the data are generated by a random mechanism rather than treated as fixed.
# 3. Probabilistic Assumptions and Estimation Properties
This chapter turns the geometric formula for ordinary least squares into a probabilistic estimator. Chapter 2 explained how $\hat\beta$ is obtained from the projection of $y$ onto $\operatorname{Range}(X)$; here we ask what that projection estimates when the data are generated by a random mechanism. The central questions are: when is $\hat\beta$ unbiased, how variable is it, how should the error variance be estimated, and why should the estimator converge as the sample size grows?
The assumptions in this chapter come in two forms. Fixed-design theory conditions on the observed design matrix and treats randomness as coming only from the error vector. Random-design theory treats each row of the design matrix as part of the random sample, which is the setting needed for consistency arguments based on laws of large numbers.
## Fixed and Random Designs
Before proving properties of $\hat\beta$, we must decide what is random. The same matrix formula $\hat\beta=(X^\top X)^{-1}X^\top y$ can describe a controlled experiment, where $X$ is chosen in advance, or an observational sample, where covariates arrive randomly with the response. These two viewpoints lead to different assumptions and different conditional statements.
[definition: Fixed-Design Linear Model]
A fixed-design linear model consists of a non-random matrix $X\in\mathbb R^{n\times p}$ with rank $p$, a parameter $\beta\in\mathbb R^p$, and a random error vector $\varepsilon\in\mathbb R^n$ such that
\begin{align*}
y=X\beta+\varepsilon.
\end{align*}
[/definition]
In this formulation, all expectations and variances are taken over the distribution of $\varepsilon$. The row $x_i^\top$ of $X$ is interpreted as a fixed collection of covariate values chosen by the analyst or conditioned upon after observation.
Many data sets do not have covariates chosen in advance; the covariate vector is observed as part of the random sample. To state assumptions for that setting, we need a model in which both the response and the design row are random.
[definition: Random-Design Linear Model]
A random-design linear model consists of random pairs $(Y_i,Z_i)_{i=1}^n$, where $Z_i\in\mathbb R^p$, a parameter $\beta\in\mathbb R^p$, and errors $\varepsilon_i$ satisfying
\begin{align*}
Y_i=Z_i^\top\beta+\varepsilon_i,
\end{align*}
with design matrix $X\in\mathbb R^{n\times p}$ whose $i$th row is $Z_i^\top$.
[/definition]
Random design is usually the right model for survey, economic, biological, and observational data. Fixed-design calculations remain useful because many finite-sample results can be stated conditionally on $X$.
[example: Controlled Dose Regression]
In a dose-response experiment, suppose the analyst chooses dose levels $z_1,\dots,z_n$ before observing the outcomes and fits
\begin{align*}
Y_i=\beta_0+\beta_1 z_i+\varepsilon_i.
\end{align*}
The fixed-design matrix is therefore
\begin{align*}
X=
\begin{pmatrix}
1 & z_1\\
1 & z_2\\
\vdots & \vdots\\
1 & z_n
\end{pmatrix},
\qquad
\beta=
\begin{pmatrix}
\beta_0\\
\beta_1
\end{pmatrix},
\qquad
y=X\beta+\varepsilon.
\end{align*}
Here $X$ is treated as non-random: the only random vector is $\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)^\top$.
When $X^\top X$ is invertible, the least-squares estimator is
\begin{align*}
\hat\beta
&=(X^\top X)^{-1}X^\top y\\
&=(X^\top X)^{-1}X^\top(X\beta+\varepsilon)\\
&=(X^\top X)^{-1}X^\top X\beta+(X^\top X)^{-1}X^\top\varepsilon\\
&=\beta+(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
Thus the intercept and dose-slope estimates vary only through the biological measurement errors. For example, if $\mathbb E[\varepsilon\mid X]=0$, then conditioning on the same chosen doses gives
\begin{align*}
\mathbb E[\hat\beta\mid X]
&=\mathbb E\!\left[\beta+(X^\top X)^{-1}X^\top\varepsilon\mid X\right]\\
&=\beta+(X^\top X)^{-1}X^\top\mathbb E[\varepsilon\mid X]\\
&=\beta.
\end{align*}
So an unbiasedness statement in this setting means: if the experiment were repeated with the same dose levels $z_1,\dots,z_n$, the average least-squares coefficient vector over new error draws would equal $(\beta_0,\beta_1)^\top$.
[/example]
The previous example fixes one test case for the idea. The next example, Observational Wage Regression, changes the setting so the same mechanism can be recognized from another angle.
[example: Observational Wage Regression]
In an observational wage regression, let $Z_i=(1,\operatorname{educ}_i,\operatorname{age}_i,\operatorname{exper}_i)^\top$ and suppose the model is
\begin{align*}
Y_i=Z_i^\top\beta+\varepsilon_i.
\end{align*}
The design matrix is random because its $i$th row is $Z_i^\top$, so
\begin{align*}
X=
\begin{pmatrix}
Z_1^\top\\
Z_2^\top\\
\vdots\\
Z_n^\top
\end{pmatrix},
\qquad
y=
\begin{pmatrix}
Y_1\\
Y_2\\
\vdots\\
Y_n
\end{pmatrix},
\qquad
\varepsilon=
\begin{pmatrix}
\varepsilon_1\\
\varepsilon_2\\
\vdots\\
\varepsilon_n
\end{pmatrix}.
\end{align*}
On the event that $X^\top X$ is invertible, the least-squares estimator satisfies
\begin{align*}
\hat\beta-\beta
&=(X^\top X)^{-1}X^\top y-\beta\\
&=(X^\top X)^{-1}X^\top(X\beta+\varepsilon)-\beta\\
&=(X^\top X)^{-1}X^\top X\beta+(X^\top X)^{-1}X^\top\varepsilon-\beta\\
&=\beta+(X^\top X)^{-1}X^\top\varepsilon-\beta\\
&=(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
Multiplying numerator and inverse denominator by $n$ rewrites the same identity as
\begin{align*}
\hat\beta-\beta
&=\left(n\frac{X^\top X}{n}\right)^{-1}X^\top\varepsilon\\
&=\frac{1}{n}\left(\frac{X^\top X}{n}\right)^{-1}X^\top\varepsilon\\
&=\left(\frac{X^\top X}{n}\right)^{-1}\frac{X^\top\varepsilon}{n}.
\end{align*}
The two empirical pieces are
\begin{align*}
\frac{X^\top X}{n}
&=\frac{1}{n}
\begin{pmatrix}
Z_1 & Z_2 & \cdots & Z_n
\end{pmatrix}
\begin{pmatrix}
Z_1^\top\\
Z_2^\top\\
\vdots\\
Z_n^\top
\end{pmatrix}
=\frac{1}{n}\sum_{i=1}^n Z_iZ_i^\top,
\\
\frac{X^\top\varepsilon}{n}
&=\frac{1}{n}
\begin{pmatrix}
Z_1 & Z_2 & \cdots & Z_n
\end{pmatrix}
\begin{pmatrix}
\varepsilon_1\\
\varepsilon_2\\
\vdots\\
\varepsilon_n
\end{pmatrix}
=\frac{1}{n}\sum_{i=1}^n Z_i\varepsilon_i.
\end{align*}
Thus conditional-on-$X$ calculations treat the realized education, age, and experience values as fixed, while consistency in the observational setting requires the random average $n^{-1}\sum Z_iZ_i^\top$ to stabilize to an invertible population second-moment matrix and the random average $n^{-1}\sum Z_i\varepsilon_i$ to converge to $0$. The wage example is therefore random-design: the estimator depends not only on new wage errors, but also on which covariate profiles enter the sample.
[/example]
## Exogeneity and Error Assumptions
Least squares can always be computed when $X^\top X$ is invertible, but its statistical interpretation depends on how the errors relate to the covariates. The main issue is whether the part of $Y$ not explained by $X\beta$ has mean zero once the relevant covariate information is known.
[definition: Strict Exogeneity]
In the linear model $y=X\beta+\varepsilon$, strict exogeneity means
\begin{align*}
\mathbb E[\varepsilon\mid X]=0.
\end{align*}
[/definition]
Strict exogeneity is the finite-sample condition that removes systematic error after conditioning on the entire design matrix. It is stronger than requiring zero unconditional mean of each error, because it rules out dependence between the covariates and the conditional mean of the error.
Mean-zero errors control bias, but standard error formulae also depend on how much the conditional error variance changes from one observation to another. The next assumption isolates the constant-variance case.
[definition: Homoskedastic Errors]
In the linear model $y=X\beta+\varepsilon$, homoskedastic errors with variance $\sigma^2>0$ satisfy that the conditional second moments $\mathbb E[\varepsilon_i^2\mid X]$ exist and
\begin{align*}
\operatorname{Var}(\varepsilon_i\mid X)=\sigma^2
\end{align*}
for every $i\in\{1,\dots,n\}$.
[/definition]
Homoskedasticity says that the conditional error variance is constant across observations. It is a variance assumption, not a mean assumption, so it is separate from exogeneity.
Constant variances still leave open whether errors move together across observations. To obtain the familiar diagonal covariance matrix, we also need to rule out conditional covariance between distinct errors.
[definition: Conditionally Uncorrelated Errors]
In the linear model $y=X\beta+\varepsilon$, the errors are conditionally uncorrelated given $X$ if the conditional second moments $\mathbb E[\varepsilon_i^2\mid X]$ exist and
\begin{align*}
\operatorname{Cov}(\varepsilon_i,\varepsilon_j\mid X)=0
\end{align*}
for all $i\ne j$.
[/definition]
Together, homoskedasticity and conditional uncorrelatedness give the compact matrix condition
\begin{align*}
\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n.
\end{align*}
Independence of errors given $X$ implies conditional uncorrelatedness when the conditional second moments exist, but uncorrelatedness is all that is needed for the basic variance formula.
For exact finite-sample tests, variance information alone is not enough; we need a distributional assumption stable under linear transformations. The Gaussian version of the error assumption supplies that extra structure.
[definition: Normal Errors]
In the linear model $y=X\beta+\varepsilon$, normal errors with variance $\sigma^2$ satisfy
\begin{align*}
\varepsilon\mid X\sim\mathcal N(0,\sigma^2 I_n).
\end{align*}
[/definition]
Normality is not needed for unbiasedness or consistency of OLS. Its main role is finite-sample distribution theory: under normal errors, linear transformations of $\varepsilon$ are normal, which gives exact $t$ and $F$ procedures in later chapters.
[example: Omitted Variable Bias Calculation]
Suppose the true model is
\begin{align*}
Y_i=\beta_0+\beta_1 X_i+\gamma W_i+u_i,
\end{align*}
with $\mathbb E[u_i\mid X_i,W_i]=0$, but we fit the simple regression of $Y_i$ on $1$ and $X_i$. Relative to the fitted simple regression, the omitted part is
\begin{align*}
\varepsilon_i
&=Y_i-\beta_0-\beta_1X_i\\
&=\gamma W_i+u_i.
\end{align*}
Taking the conditional expectation given $X_i$ gives
\begin{align*}
\mathbb E[\varepsilon_i\mid X_i]
&=\mathbb E[\gamma W_i+u_i\mid X_i]\\
&=\gamma\mathbb E[W_i\mid X_i]+\mathbb E[u_i\mid X_i]\\
&=\gamma\mathbb E[W_i\mid X_i]+\mathbb E\!\left[\mathbb E[u_i\mid X_i,W_i]\mid X_i\right]\\
&=\gamma\mathbb E[W_i\mid X_i].
\end{align*}
Thus if $\gamma\ne0$ and $\mathbb E[W_i\mid X_i]$ changes with $X_i$, the fitted error has nonzero conditional mean as a function of the included covariate, so the exogeneity condition for the simple regression fails.
Assume $\operatorname{Var}(X_i)>0$. The population simple-regression slope is
\begin{align*}
\frac{\operatorname{Cov}(X_i,Y_i)}{\operatorname{Var}(X_i)}.
\end{align*}
Substituting the true model into the numerator,
\begin{align*}
\operatorname{Cov}(X_i,Y_i)
&=\operatorname{Cov}(X_i,\beta_0+\beta_1X_i+\gamma W_i+u_i)\\
&=\operatorname{Cov}(X_i,\beta_0)+\beta_1\operatorname{Cov}(X_i,X_i)+\gamma\operatorname{Cov}(X_i,W_i)+\operatorname{Cov}(X_i,u_i)\\
&=0+\beta_1\operatorname{Var}(X_i)+\gamma\operatorname{Cov}(X_i,W_i)+\operatorname{Cov}(X_i,u_i).
\end{align*}
The last covariance is zero because
\begin{align*}
\mathbb E[X_iu_i]
&=\mathbb E\!\left[X_i\mathbb E[u_i\mid X_i]\right]\\
&=\mathbb E\!\left[X_i\mathbb E\!\left[\mathbb E[u_i\mid X_i,W_i]\mid X_i\right]\right]\\
&=0,
\end{align*}
and also
\begin{align*}
\mathbb E[u_i]
&=\mathbb E\!\left[\mathbb E[u_i\mid X_i,W_i]\right]
=0.
\end{align*}
Therefore $\operatorname{Cov}(X_i,u_i)=\mathbb E[X_iu_i]-\mathbb E[X_i]\mathbb E[u_i]=0$, and the population fitted slope is
\begin{align*}
\frac{\operatorname{Cov}(X_i,Y_i)}{\operatorname{Var}(X_i)}
&=\frac{\beta_1\operatorname{Var}(X_i)+\gamma\operatorname{Cov}(X_i,W_i)}{\operatorname{Var}(X_i)}\\
&=\beta_1+\gamma\frac{\operatorname{Cov}(X_i,W_i)}{\operatorname{Var}(X_i)}.
\end{align*}
So the omitted-variable bias is
\begin{align*}
\gamma\frac{\operatorname{Cov}(X_i,W_i)}{\operatorname{Var}(X_i)},
\end{align*}
the effect of the omitted variable multiplied by its linear association with the included covariate.
[/example]
Mean exogeneity and error independence are logically separate. The omitted-variable example shows a failure of the conditional mean condition; the next example keeps the conditional mean idea in the background but shows how dependence across observations breaks the usual covariance calculation.
[example: Repeated-Measure Violation of Independence]
Suppose each subject $j$ is measured at times $t=1,\dots,m$, and the model is
\begin{align*}
Y_{jt}=Z_{jt}^\top\beta+a_j+u_{jt},
\end{align*}
where $a_j$ is an unobserved subject effect. If the fitted model omits $a_j$, then its error for observation $(j,t)$ is
\begin{align*}
\varepsilon_{jt}=a_j+u_{jt}.
\end{align*}
For two different times $t\ne s$ on the same subject, assume the idiosyncratic errors are conditionally uncorrelated with each other and with the subject effect:
\begin{align*}
\operatorname{Cov}(u_{jt},u_{js}\mid X)=0,\qquad
\operatorname{Cov}(a_j,u_{jt}\mid X)=0,\qquad
\operatorname{Cov}(a_j,u_{js}\mid X)=0.
\end{align*}
Then, using bilinearity of conditional covariance,
\begin{align*}
\operatorname{Cov}(\varepsilon_{jt},\varepsilon_{js}\mid X)
&=\operatorname{Cov}(a_j+u_{jt},a_j+u_{js}\mid X)\\
&=\operatorname{Cov}(a_j,a_j\mid X)
+\operatorname{Cov}(a_j,u_{js}\mid X)
+\operatorname{Cov}(u_{jt},a_j\mid X)
+\operatorname{Cov}(u_{jt},u_{js}\mid X)\\
&=\operatorname{Var}(a_j\mid X)+0+0+0\\
&=\operatorname{Var}(a_j\mid X).
\end{align*}
Thus, whenever $\operatorname{Var}(a_j\mid X)>0$, two observations from the same subject have positive conditional covariance even though their idiosyncratic errors are conditionally uncorrelated. The usual independence, and even the conditionally uncorrelated error assumption, fails within subject; repeated-measure data therefore require clustered or mixed-model methods rather than the simplest OLS standard errors.
[/example]
## Unbiasedness of Ordinary Least Squares
The first finite-sample question is whether the average value of the estimator equals the parameter it is supposed to estimate. Since the OLS error can be written exactly as a linear transformation of $\varepsilon$, unbiasedness is a direct consequence of strict exogeneity.
[quotetheorem:4435]
[citeproof:4435]
This theorem says that OLS gets the mean right under the correct conditional mean assumption. The rank assumption is needed so that the least-squares coefficient vector is uniquely defined; if two columns of $X$ are collinear, the same fitted values can correspond to multiple coefficient vectors. The strict exogeneity assumption is doing the statistical work: if an omitted variable has nonzero conditional mean given $X$, the estimator is centred at the coefficient of the best conditional projection rather than at the structural parameter $\beta$. The theorem does not say that any particular dataset produces a coefficient close to $\beta$; that requires variance control and, for large samples, consistency. It also does not justify the usual standard errors, which need separate second-moment assumptions developed next.
[remark: Unbiasedness Is Not Robust To Mean Misspecification]
If $\mathbb E[\varepsilon\mid X]\ne0$, then
\begin{align*}
\mathbb E[\hat\beta\mid X]-\beta=(X^\top X)^{-1}X^\top\mathbb E[\varepsilon\mid X].
\end{align*}
The bias is therefore the projection of the conditional mean error onto the columns of $X$.
[/remark]
## Variance and Covariance of the OLS Estimator
After establishing that the estimator is centred at $\beta$, the next question is how much it varies across repeated samples. The answer depends on the covariance structure of the errors, and the familiar simple formula requires homoskedastic uncorrelated errors. If the errors have larger variance at high-leverage observations, or if neighbouring observations have positive conditional covariance, then the matrix $\operatorname{Var}(\varepsilon\mid X)$ is not $\sigma^2I_n$ and the simplification to $\sigma^2(X^\top X)^{-1}$ fails. The theorem therefore records not merely a convenient calculation, but the precise covariance assumptions under which the classical standard-error formula is valid.
[quotetheorem:4436]
[citeproof:4436]
The diagonal entries of $\sigma^2(X^\top X)^{-1}$ are the conditional variances of the coefficient estimates. The full-rank hypothesis is again essential, because a singular $X^\top X$ means at least one coefficient direction is not identified by the design and its variance cannot be represented by this inverse matrix. The condition $\operatorname{Var}(\varepsilon\mid X)=\sigma^2I_n$ rules out both heteroskedasticity and conditional correlation; if either is present, the estimator may remain unbiased while this covariance formula becomes wrong. The theorem does not claim that the estimated standard errors are correct, since $\sigma^2$ is still unknown and must be estimated. The off-diagonal entries are their conditional covariances, which matter when testing linear combinations of coefficients, and this is the matrix that later feeds directly into confidence intervals, $t$-statistics, and $F$-tests.
[example: Simple Regression Slope Variance]
For simple regression with intercept and scalar covariate values $x_1,\dots,x_n$, write
\begin{align*}
X=
\begin{pmatrix}
1 & x_1\\
1 & x_2\\
\vdots & \vdots\\
1 & x_n
\end{pmatrix},
\qquad
\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i.
\end{align*}
Assume the design is not constant, so $\sum_{i=1}^n(x_i-\bar{x})^2>0$ and $X^\top X$ is invertible. Under homoskedastic uncorrelated errors, *Conditional Variance Formula For OLS* gives
\begin{align*}
\operatorname{Var}(\hat\beta\mid X)=\sigma^2(X^\top X)^{-1}.
\end{align*}
Now
\begin{align*}
X^\top X
&=
\begin{pmatrix}
1 & 1 & \cdots & 1\\
x_1 & x_2 & \cdots & x_n
\end{pmatrix}
\begin{pmatrix}
1 & x_1\\
1 & x_2\\
\vdots & \vdots\\
1 & x_n
\end{pmatrix} \\
&=
\begin{pmatrix}
n & \sum_{i=1}^n x_i\\
\sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2
\end{pmatrix}.
\end{align*}
For a matrix $\begin{pmatrix}a&b\\ b&c\end{pmatrix}$ with determinant $ac-b^2\ne0$, its inverse is
\begin{align*}
\begin{pmatrix}a&b\\ b&c\end{pmatrix}^{-1}
=
\frac{1}{ac-b^2}
\begin{pmatrix}
c & -b\\
-b & a
\end{pmatrix}.
\end{align*}
Applying this formula with $a=n$, $b=\sum_{i=1}^n x_i$, and $c=\sum_{i=1}^n x_i^2$ gives
\begin{align*}
(X^\top X)^{-1}
=
\frac{1}{n\sum_{i=1}^n x_i^2-\left(\sum_{i=1}^n x_i\right)^2}
\begin{pmatrix}
\sum_{i=1}^n x_i^2 & -\sum_{i=1}^n x_i\\
-\sum_{i=1}^n x_i & n
\end{pmatrix}.
\end{align*}
The slope is the second component of $\hat\beta$, so its conditional variance is $\sigma^2$ times the lower-right entry of $(X^\top X)^{-1}$:
\begin{align*}
\operatorname{Var}(\hat\beta_1\mid X)
&=\sigma^2\frac{n}{n\sum_{i=1}^n x_i^2-\left(\sum_{i=1}^n x_i\right)^2}.
\end{align*}
It remains to rewrite the denominator in centered form. Since $\sum_{i=1}^n x_i=n\bar{x}$,
\begin{align*}
\sum_{i=1}^n(x_i-\bar{x})^2
&=\sum_{i=1}^n\left(x_i^2-2\bar{x}x_i+\bar{x}^2\right)\\
&=\sum_{i=1}^n x_i^2-2\bar{x}\sum_{i=1}^n x_i+n\bar{x}^2\\
&=\sum_{i=1}^n x_i^2-2n\bar{x}^2+n\bar{x}^2\\
&=\sum_{i=1}^n x_i^2-n\bar{x}^2\\
&=\sum_{i=1}^n x_i^2-\frac{\left(\sum_{i=1}^n x_i\right)^2}{n}.
\end{align*}
Multiplying both sides by $n$ gives
\begin{align*}
n\sum_{i=1}^n(x_i-\bar{x})^2
&=n\sum_{i=1}^n x_i^2-\left(\sum_{i=1}^n x_i\right)^2.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(\hat\beta_1\mid X)
&=\sigma^2\frac{n}{n\sum_{i=1}^n(x_i-\bar{x})^2}\\
&=\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}.
\end{align*}
Thus the slope variance is smaller when the covariate values have larger centered sum of squares, because more spread in the design makes changes in the mean response per unit change in $x$ easier to distinguish from noise.
[/example]
The example also points to a boundary of the idea. The following remark, Heteroskedastic Errors, records that interpretation before the construction is used again.
[remark: Heteroskedastic Errors]
If $\operatorname{Var}(\varepsilon\mid X)=\Omega$ for a non-scalar covariance matrix $\Omega$, then
\begin{align*}
\operatorname{Var}(\hat\beta\mid X)=(X^\top X)^{-1}X^\top\Omega X(X^\top X)^{-1}.
\end{align*}
This is the starting point for heteroskedasticity-robust covariance estimation.
[/remark]
## Estimating the Error Variance
The formula for $\operatorname{Var}(\hat\beta\mid X)$ contains the unknown number $\sigma^2$. A first attempt would divide the residual sum of squares by $n$, but this ignores that the residuals are computed after choosing $p$ coefficients to fit the same data. In the extreme case $p=n$ and $X$ is invertible, the residuals are all zero even when the original errors are not, so division by $n$ would systematically underestimate the noise. The natural estimator therefore uses the residual sum of squares, but it must account for the $p$ fitted coefficients already estimated from the data.
[definition: Residual Sum Of Squares]
For the OLS fit $\hat y=X\hat\beta$, the residual vector is $\hat\varepsilon=y-\hat y$ and the residual sum of squares is
\begin{align*}
\operatorname{RSS}=\hat\varepsilon^\top\hat\varepsilon.
\end{align*}
[/definition]
The residuals are shorter than the original errors because OLS removes the component of $y$ lying in the column space of $X$. The loss of $p$ dimensions is the reason for the denominator $n-p$.
The issue is that this dimension count must control an actual random quantity, not just the shape of a subspace. If RSS has the wrong expected size, then dividing it by $n-p$ would still be an arbitrary correction. The next result verifies that, under the spherical error assumptions, the expected residual sum of squares is exactly $\sigma^2$ times the residual dimension.
[quotetheorem:4437]
[citeproof:4437]
This theorem explains degrees of freedom as a dimension count. The rank condition is needed because the residual-maker matrix then has trace $n-p$; without full rank, the dimension removed by fitting is the rank of $X$, not the formal number of columns. The spherical covariance condition is also doing work: under heteroskedasticity, $\mathbb E[\operatorname{RSS}\mid X]$ depends on the diagonal entries of $M\Omega$ rather than reducing to $\sigma^2(n-p)$. The result does not say that $\hat\sigma^2$ is a good estimator under arbitrary misspecification, nor does it give its sampling distribution without normality. The residuals live in the orthogonal complement of the column space of $X$, whose dimension is $n-p$, and that geometric fact becomes the degrees-of-freedom adjustment in later exact inference.
[remark: Normal Errors And Exact Sampling Distributions]
Under the stronger assumption $\varepsilon\mid X\sim\mathcal N(0,\sigma^2I_n)$, the quadratic form $\operatorname{RSS}/\sigma^2$ has a $\chi^2_{n-p}$ distribution and is independent of $\hat\beta$ conditional on $X$. These facts are used for exact confidence intervals and hypothesis tests.
[/remark]
## Consistency Under Random Design
Unbiasedness is a finite-sample average statement, while consistency asks whether $\hat\beta$ approaches $\beta$ as more data are observed. In random design, the key algebra separates the estimator error into a matrix term $X^\top X/n$ and an empirical average term $X^\top\varepsilon/n$. This separation also shows what can go wrong: if the population second-moment matrix is singular, more data do not identify all coefficient directions; if $\mathbb E[Z_i\varepsilon_i]\ne0$, the empirical regressor-error average converges to a nonzero vector. Consistency is therefore a joint statement about identification and exogeneity, not merely about having a large sample.
[definition: Consistency]
An estimator sequence $\hat\beta_n$ of $\beta\in\mathbb R^p$ is consistent if
\begin{align*}
\hat\beta_n\xrightarrow{\mathbb P}\beta
\end{align*}
as $n\to\infty$.
[/definition]
The definition uses convergence in probability because $\hat\beta_n$ is random for every sample size. The target is a fixed parameter in the correctly specified model.
For OLS, the main obstruction is that the estimator contains an inverse random matrix multiplying a random regressor-error average. A large sample helps only if the design matrix stabilises toward a nonsingular limit and the regressor-error average vanishes. The following result packages these two requirements into conditions that make the inverse well behaved and force the estimation error to disappear.
[quotetheorem:4438]
[citeproof:4438]
The theorem says that the empirical geometry stabilises and the empirical correlation between regressors and errors vanishes. Positive definiteness of $Q$ is the population analogue of the full-rank condition: no nonzero linear combination of covariates has zero variance. If $Q$ is singular, an entire direction of $\beta$ is not identified in the population, so no law of large numbers can rescue the inverse. If $\mathbb E[\varepsilon_i\mid Z_i]=0$ fails in a way that leaves $\mathbb E[Z_i\varepsilon_i]\ne0$, the estimator converges to a biased population projection rather than to $\beta$. The theorem does not provide a rate, a limiting normal distribution, or valid standard errors; those require central limit theorem arguments and covariance estimation developed in later inference chapters.
[example: Simulated Convergence Of A Simple Regression Slope]
Consider simulated data from
\begin{align*}
Y_i=1+2Z_i+\varepsilon_i,
\end{align*}
where $Z_i\sim\mathcal N(0,1)$, $\varepsilon_i\sim\mathcal N(0,1)$, and the two sequences are independent. In the simple regression with an intercept, the fitted slope is
\begin{align*}
\hat\beta_{1,n}
&=\frac{\sum_{i=1}^n (Z_i-\bar Z)(Y_i-\bar Y)}{\sum_{i=1}^n (Z_i-\bar Z)^2},
\end{align*}
provided $\sum_{i=1}^n (Z_i-\bar Z)^2>0$. Since
\begin{align*}
\bar Y
&=\frac{1}{n}\sum_{i=1}^n(1+2Z_i+\varepsilon_i)\\
&=1+2\bar Z+\bar\varepsilon,
\end{align*}
we have
\begin{align*}
Y_i-\bar Y
&=(1+2Z_i+\varepsilon_i)-(1+2\bar Z+\bar\varepsilon)\\
&=2(Z_i-\bar Z)+(\varepsilon_i-\bar\varepsilon).
\end{align*}
Substituting this into the slope formula gives
\begin{align*}
\hat\beta_{1,n}
&=\frac{\sum_{i=1}^n (Z_i-\bar Z)\left(2(Z_i-\bar Z)+(\varepsilon_i-\bar\varepsilon)\right)}{\sum_{i=1}^n (Z_i-\bar Z)^2}\\
&=\frac{2\sum_{i=1}^n (Z_i-\bar Z)^2+\sum_{i=1}^n (Z_i-\bar Z)(\varepsilon_i-\bar\varepsilon)}{\sum_{i=1}^n (Z_i-\bar Z)^2}\\
&=2+\frac{\sum_{i=1}^n (Z_i-\bar Z)(\varepsilon_i-\bar\varepsilon)}{\sum_{i=1}^n (Z_i-\bar Z)^2}.
\end{align*}
The remaining numerator can be rewritten as
\begin{align*}
\sum_{i=1}^n (Z_i-\bar Z)(\varepsilon_i-\bar\varepsilon)
&=\sum_{i=1}^n (Z_i-\bar Z)\varepsilon_i-\bar\varepsilon\sum_{i=1}^n(Z_i-\bar Z)\\
&=\sum_{i=1}^n (Z_i-\bar Z)\varepsilon_i-\bar\varepsilon\left(\sum_{i=1}^nZ_i-n\bar Z\right)\\
&=\sum_{i=1}^n (Z_i-\bar Z)\varepsilon_i\\
&=\sum_{i=1}^n Z_i\varepsilon_i-\bar Z\sum_{i=1}^n\varepsilon_i.
\end{align*}
Dividing numerator and denominator by $n$,
\begin{align*}
\hat\beta_{1,n}-2
&=\frac{n^{-1}\sum_{i=1}^n Z_i\varepsilon_i-\bar Z\,n^{-1}\sum_{i=1}^n\varepsilon_i}{n^{-1}\sum_{i=1}^n (Z_i-\bar Z)^2}.
\end{align*}
Also,
\begin{align*}
\frac{1}{n}\sum_{i=1}^n (Z_i-\bar Z)^2
&=\frac{1}{n}\sum_{i=1}^n\left(Z_i^2-2\bar Z Z_i+\bar Z^2\right)\\
&=\frac{1}{n}\sum_{i=1}^n Z_i^2-2\bar Z\frac{1}{n}\sum_{i=1}^nZ_i+\bar Z^2\\
&=\frac{1}{n}\sum_{i=1}^n Z_i^2-\bar Z^2.
\end{align*}
By the weak law of large numbers,
\begin{align*}
\frac{1}{n}\sum_{i=1}^n Z_i^2\xrightarrow{\mathbb P}\mathbb E[Z_i^2]=1,
\qquad
\bar Z\xrightarrow{\mathbb P}0,
\end{align*}
so the denominator converges in probability to $1$. Independence gives $\mathbb E[Z_i\varepsilon_i]=\mathbb E[Z_i]\mathbb E[\varepsilon_i]=0$, and the weak law of large numbers gives
\begin{align*}
\frac{1}{n}\sum_{i=1}^n Z_i\varepsilon_i\xrightarrow{\mathbb P}0,
\qquad
\frac{1}{n}\sum_{i=1}^n\varepsilon_i\xrightarrow{\mathbb P}0.
\end{align*}
Therefore the numerator converges in probability to $0-0\cdot0=0$, and hence
\begin{align*}
\hat\beta_{1,n}-2\xrightarrow{\mathbb P}0.
\end{align*}
So repeated simulations at larger sample sizes produce slope estimates increasingly concentrated near the true slope $2$, because the centered design variation stabilizes while the empirical regressor-error average vanishes.
[/example]
This example previews asymptotic normality: after consistency establishes that the estimator is near $\beta$, the next refinement studies the scaled error $\sqrt n(\hat\beta_n-\beta)$. The same two empirical pieces appear again in Chapter 6, where the law of large numbers for $X^\top X/n$ is paired with a central limit theorem for $X^\top\varepsilon/\sqrt n$.
[remark: Conditional Exogeneity Versus Strict Exogeneity]
For consistency in the i.i.d. random-design setting, the condition $\mathbb E[\varepsilon_i\mid Z_i]=0$ is usually enough because the proof only needs $\mathbb E[Z_i\varepsilon_i]=0$. For finite-sample conditional unbiasedness, strict exogeneity $\mathbb E[\varepsilon\mid X]=0$ is stronger and rules out feedback from other observations' covariates to a given error.
[/remark]
## What The Assumptions Buy
The assumptions in this chapter have distinct jobs. Strict exogeneity centres the estimator at $\beta$; homoskedastic uncorrelated errors produce the simple covariance matrix; the degrees-of-freedom correction makes $\hat\sigma^2$ unbiased; and random-design moment conditions make empirical averages converge to their population limits.
[explanation: Summary Of Estimation Properties]
The exact decomposition
\begin{align*}
\hat\beta-\beta=(X^\top X)^{-1}X^\top\varepsilon
\end{align*}
is the organising identity for the chapter. If the conditional mean of $\varepsilon$ is zero, the conditional mean of the estimator error is zero. If the conditional covariance of $\varepsilon$ is $\sigma^2I_n$, the conditional covariance of the estimator error is $\sigma^2(X^\top X)^{-1}$. If the design is random and the empirical matrices obey laws of large numbers, the same identity becomes an asymptotic argument for consistency.
These results also indicate what fails when assumptions are violated. Omitted variables create a nonzero conditional mean error; repeated measures create correlated errors; heteroskedasticity changes the covariance matrix; weak or collinear covariates make $X^\top X/n$ unstable. Later inference methods are refinements of this chapter's theme: preserve the estimator when its mean behaviour is right, and repair the variance calculation when the simplest covariance assumptions are too restrictive.
[/explanation]
Once the estimator is viewed probabilistically, its properties depend on the covariance structure of the errors and on how the design behaves asymptotically. The next step is to sharpen those conclusions by asking not just whether OLS is unbiased, but whether it is optimal among linear unbiased estimators.
# 4. The Gauss-Markov Theorem
The preceding chapter established ordinary least squares as a projection estimator and computed its variance under the homoskedastic linear model. This chapter asks a sharper question: among all estimators that are linear in the observed responses and unbiased for a fixed linear target, is OLS the most precise? The Gauss-Markov theorem answers yes under the classical second-moment assumptions, and its proof explains exactly where geometry, unbiasedness, and homoskedasticity enter.
## Linear Unbiased Estimators and BLUE
The first issue is to decide which estimators are being compared. Least squares is not being compared with every conceivable procedure; it is compared with estimators that are linear functions of $Y$ and have the right expectation for every value of the regression parameter.
[definition: Linear Estimator]
Let $Y \in \mathbb R^n$ be the response vector. An estimator $\tilde\theta$ of a scalar target is called linear if there exists a vector $a \in \mathbb R^n$ such that
\begin{align*}
\tilde\theta = a^\top Y.
\end{align*}
More generally, an estimator $\tilde\beta$ of $\beta \in \mathbb R^p$ is called linear if there exists a matrix $A \in \mathbb R^{p \times n}$ such that
\begin{align*}
\tilde\beta = AY.
\end{align*}
[/definition]
The word linear refers to linearity in the data $Y$, not to linearity as a function of the parameter $\beta$. Many nonlinear functions of the covariates still lead to linear estimators once the design matrix has been chosen.
[definition: Unbiased Linear Estimator]
In the linear model
\begin{align*}
Y = X\beta + \varepsilon,
\end{align*}
where $X \in \mathbb R^{n \times p}$ has rank $p$ and $\mathbb E[\varepsilon] = 0$, a linear estimator $AY$ of $\beta$ is unbiased if
\begin{align*}
\mathbb E[AY] = \beta
\end{align*}
for every $\beta \in \mathbb R^p$.
[/definition]
Since $\mathbb E[AY]=AX\beta$, unbiasedness for every $\beta$ is equivalent to the algebraic constraint $AX=I_p$. Thus the collection of unbiased linear estimators is the collection of all left inverses of $X$, applied to $Y$.
Among those left inverses, unbiasedness alone gives no preference: many linear rules can hit the right mean. The real comparison problem is that variance is not a single number for a vector estimator; one rule might be more variable for one linear combination of coefficients and less variable for another. The BLUE criterion resolves this by requiring no larger variance in every coefficient direction.
[definition: BLUE]
Let $\mathcal A = \{A \in \mathbb R^{p \times n}: AX=I_p\}$. A linear unbiased estimator $A_0Y$ of $\beta$ is called the best linear unbiased estimator, or BLUE, if for every $A \in \mathcal A$,
\begin{align*}
\operatorname{Var}(c^\top A_0Y) \le \operatorname{Var}(c^\top AY)
\end{align*}
for every $c \in \mathbb R^p$.
[/definition]
The variance comparison is matrix-valued. Equivalently, $A_0Y$ is BLUE when $\operatorname{Var}(AY)-\operatorname{Var}(A_0Y)$ is positive semidefinite for every unbiased linear competitor $AY$.
[example: Weighted Linear Estimator]
Suppose $p=1$ and $X=1_n$, so the model is $Y_i=\mu+\varepsilon_i$ with $\mathbb E[\varepsilon_i]=0$, $\operatorname{Var}(\varepsilon_i)=\sigma^2$, and $\operatorname{Cov}(\varepsilon_i,\varepsilon_j)=0$ for $i\ne j$. For a linear estimator
\begin{align*}
\tilde\mu=\sum_{i=1}^n w_iY_i,
\end{align*}
its expectation is
\begin{align*}
\mathbb E[\tilde\mu]
&= \mathbb E\left[\sum_{i=1}^n w_iY_i\right] \\
&= \sum_{i=1}^n w_i\mathbb E[Y_i] \\
&= \sum_{i=1}^n w_i\mu \\
&= \mu\sum_{i=1}^n w_i.
\end{align*}
Thus $\tilde\mu$ is unbiased for every $\mu$ exactly when $\sum_{i=1}^n w_i=1$.
Under the uncorrelated homoskedastic error assumptions,
\begin{align*}
\operatorname{Var}(\tilde\mu)
&= \operatorname{Var}\left(\sum_{i=1}^n w_iY_i\right) \\
&= \operatorname{Var}\left(\sum_{i=1}^n w_i\varepsilon_i\right) \\
&= \sum_{i=1}^n w_i^2\operatorname{Var}(\varepsilon_i)
+2\sum_{1\le i<j\le n}w_iw_j\operatorname{Cov}(\varepsilon_i,\varepsilon_j) \\
&= \sum_{i=1}^n w_i^2\sigma^2
+2\sum_{1\le i<j\le n}w_iw_j\cdot 0 \\
&= \sigma^2\sum_{i=1}^n w_i^2.
\end{align*}
If $\sum_{i=1}^n w_i=1$, then
\begin{align*}
0
&\le \sum_{i=1}^n\left(w_i-\frac1n\right)^2 \\
&= \sum_{i=1}^n w_i^2-\frac2n\sum_{i=1}^n w_i+\sum_{i=1}^n\frac1{n^2} \\
&= \sum_{i=1}^n w_i^2-\frac2n+\frac1n \\
&= \sum_{i=1}^n w_i^2-\frac1n.
\end{align*}
Therefore $\sum_{i=1}^n w_i^2\ge 1/n$, with equality when $w_i=1/n$ for every $i$. The equal-weight estimator $\bar Y=n^{-1}\sum_{i=1}^n Y_i$ therefore has variance $\sigma^2/n$ and is the minimum-variance estimator among all linear unbiased estimators of $\mu$.
[/example]
This example is the scalar version of Gauss-Markov. The full theorem replaces the line spanned by $1_n$ with the column space of $X$ and replaces scalar weights with a left inverse of $X$.
## The Gauss-Markov Theorem
Why should OLS win among all unbiased linear estimators? The essential reason is that every competitor can be decomposed into the OLS estimator plus a perturbation that vanishes on the column space of $X$; under homoskedastic uncorrelated errors, that perturbation is orthogonal in variance to the OLS part.
[quotetheorem:1437]
[citeproof:1437]
The theorem is finite-sample: it does not rely on limits, asymptotic approximations, or Gaussian errors. Each hypothesis has a distinct role. If $X$ does not have full column rank, then two different parameter vectors can give the same mean vector $X\beta$, so the whole vector $\beta$ is not identifiable and no unbiased estimator of all components can exist. If the errors are correlated or have unequal variances, the Euclidean projection geometry used by OLS no longer matches the covariance geometry; for instance, with two intercept-only observations of variances $1$ and $100$, the equal-weight average is beaten by an inverse-variance weighted average. If $\mathbb E[\varepsilon]\ne 0$, then the OLS estimator inherits that bias and the comparison among unbiased estimators is no longer the right description. These failures lead naturally to two later refinements: restricting attention to estimable contrasts when rank fails, and replacing OLS by weighted least squares later in this chapter, or by generalized least squares when a full covariance structure is known.
[remark: Meaning of Best]
The word best in BLUE means smallest variance in the positive semidefinite order among linear unbiased estimators. It does not mean smallest mean squared error among biased estimators, and it does not compare OLS with nonlinear estimators.
[/remark]
The next example keeps the comparison class fixed but changes the target from a full regression vector to a treatment contrast. It shows how the same variance-minimisation argument appears in the familiar difference-in-means estimator.
[example: Treatment Contrast in a Balanced Experiment]
Consider a balanced experiment with $m$ observations in treatment group $1$ and $m$ observations in treatment group $0$:
\begin{align*}
Y_{1j} &= \mu_1+\varepsilon_{1j}, & Y_{0j} &= \mu_0+\varepsilon_{0j}, && j=1,\dots,m.
\end{align*}
Assume $\mathbb E[\varepsilon_{gj}]=0$, $\operatorname{Var}(\varepsilon_{gj})=\sigma^2$, and all errors are uncorrelated. The target contrast is $\tau=\mu_1-\mu_0$.
The usual difference in group means is
\begin{align*}
\hat\tau
&= \bar Y_1-\bar Y_0 \\
&= \frac1m\sum_{j=1}^m Y_{1j}-\frac1m\sum_{j=1}^m Y_{0j}.
\end{align*}
Its expectation is
\begin{align*}
\mathbb E[\hat\tau]
&= \frac1m\sum_{j=1}^m \mathbb E[Y_{1j}]
-\frac1m\sum_{j=1}^m \mathbb E[Y_{0j}] \\
&= \frac1m\sum_{j=1}^m \mu_1
-\frac1m\sum_{j=1}^m \mu_0 \\
&= \frac{m\mu_1}{m}-\frac{m\mu_0}{m} \\
&= \mu_1-\mu_0 \\
&= \tau.
\end{align*}
Since constants have zero variance and the errors are uncorrelated,
\begin{align*}
\operatorname{Var}(\hat\tau)
&= \operatorname{Var}\left(
\frac1m\sum_{j=1}^m \varepsilon_{1j}
-\frac1m\sum_{j=1}^m \varepsilon_{0j}
\right) \\
&= \sum_{j=1}^m \left(\frac1m\right)^2\operatorname{Var}(\varepsilon_{1j})
+\sum_{j=1}^m \left(-\frac1m\right)^2\operatorname{Var}(\varepsilon_{0j}) \\
&= \sum_{j=1}^m \frac{\sigma^2}{m^2}
+\sum_{j=1}^m \frac{\sigma^2}{m^2} \\
&= \frac{m\sigma^2}{m^2}+\frac{m\sigma^2}{m^2} \\
&= \frac{2\sigma^2}{m}.
\end{align*}
Now take any linear estimator of the contrast,
\begin{align*}
\tilde\tau=\sum_{j=1}^m a_jY_{1j}+\sum_{j=1}^m b_jY_{0j}.
\end{align*}
Its expectation is
\begin{align*}
\mathbb E[\tilde\tau]
&= \sum_{j=1}^m a_j\mathbb E[Y_{1j}]
+\sum_{j=1}^m b_j\mathbb E[Y_{0j}] \\
&= \sum_{j=1}^m a_j\mu_1+\sum_{j=1}^m b_j\mu_0 \\
&= \mu_1\sum_{j=1}^m a_j+\mu_0\sum_{j=1}^m b_j.
\end{align*}
For this to equal $\mu_1-\mu_0$ for every pair $(\mu_1,\mu_0)$, the coefficients must satisfy
\begin{align*}
\sum_{j=1}^m a_j=1,
\qquad
\sum_{j=1}^m b_j=-1.
\end{align*}
Under these unbiasedness constraints, the variance is
\begin{align*}
\operatorname{Var}(\tilde\tau)
&= \operatorname{Var}\left(\sum_{j=1}^m a_j\varepsilon_{1j}
+\sum_{j=1}^m b_j\varepsilon_{0j}\right) \\
&= \sum_{j=1}^m a_j^2\sigma^2+\sum_{j=1}^m b_j^2\sigma^2 \\
&= \sigma^2\left(\sum_{j=1}^m a_j^2+\sum_{j=1}^m b_j^2\right).
\end{align*}
The treatment weights satisfy
\begin{align*}
0
&\le \sum_{j=1}^m\left(a_j-\frac1m\right)^2 \\
&= \sum_{j=1}^m a_j^2-\frac2m\sum_{j=1}^m a_j+\sum_{j=1}^m\frac1{m^2} \\
&= \sum_{j=1}^m a_j^2-\frac2m+\frac1m \\
&= \sum_{j=1}^m a_j^2-\frac1m,
\end{align*}
so $\sum_{j=1}^m a_j^2\ge 1/m$, with equality exactly when $a_j=1/m$ for every $j$. Similarly,
\begin{align*}
0
&\le \sum_{j=1}^m\left(b_j+\frac1m\right)^2 \\
&= \sum_{j=1}^m b_j^2+\frac2m\sum_{j=1}^m b_j+\sum_{j=1}^m\frac1{m^2} \\
&= \sum_{j=1}^m b_j^2-\frac2m+\frac1m \\
&= \sum_{j=1}^m b_j^2-\frac1m,
\end{align*}
so $\sum_{j=1}^m b_j^2\ge 1/m$, with equality exactly when $b_j=-1/m$ for every $j$. Therefore every linear unbiased estimator has
\begin{align*}
\operatorname{Var}(\tilde\tau)
&= \sigma^2\left(\sum_{j=1}^m a_j^2+\sum_{j=1}^m b_j^2\right) \\
&\ge \sigma^2\left(\frac1m+\frac1m\right) \\
&= \frac{2\sigma^2}{m},
\end{align*}
and equality is attained by $\hat\tau=\bar Y_1-\bar Y_0$. Thus the balanced difference in group means is the minimum-variance estimator among all linear unbiased estimators of the treatment contrast.
[/example]
The treatment example shows why randomisation alone is not the mathematical content of Gauss-Markov. Once the design and second moments are fixed, the conclusion is an algebraic statement about unbiased linear combinations.
## Equality and Uniqueness in the BLUE Comparison
The theorem gives a variance inequality, but it is useful to know when another estimator ties OLS. Equality can occur only when the perturbation has no variance in the target direction being considered.
[quotetheorem:4439]
[citeproof:4439]
This result separates equality for one contrast from equality for the whole vector. The same hypotheses as Gauss-Markov are doing real work here: the proof uses the identity covariance geometry to turn the variance penalty into the squared Euclidean length $|(A-B)^\top c|^2$. With correlated or heteroskedastic errors, a nonzero perturbation can interact with the OLS part through nonvanishing covariance terms, so the equality condition must be written using the covariance matrix rather than ordinary orthogonality. Full rank is also essential because $B=(X^\top X)^{-1}X^\top$ is then a genuine left inverse; without it, equality must be discussed for estimable contrasts instead of the whole parameter vector. The following example makes the contrast-by-contrast nature of equality explicit and prepares for the later shift from Euclidean geometry to covariance-weighted geometry.
[example: A Competitor Tying One Contrast]
Let $B=(X^\top X)^{-1}X^\top$, so the OLS estimator is $BY$. Assume $X$ has full column rank and choose $v\in\mathbb R^n$ with $X^\top v=0$. For $d\in\mathbb R^p$, define
\begin{align*}
A=B+dv^\top .
\end{align*}
Since $v^\top X=(X^\top v)^\top=0^\top$, we have
\begin{align*}
AX
&=(B+dv^\top)X \\
&=BX+dv^\top X \\
&=(X^\top X)^{-1}X^\top X+d0^\top \\
&=I_p+0 \\
&=I_p.
\end{align*}
Therefore $AY$ is linear and unbiased for $\beta$.
Now fix a contrast $c^\top\beta$. Since $A-B=dv^\top$,
\begin{align*}
B(A-B)^\top
&=B(dv^\top)^\top \\
&=Bvd^\top \\
&=(X^\top X)^{-1}X^\top v\,d^\top \\
&=(X^\top X)^{-1}0\,d^\top \\
&=0.
\end{align*}
Using $\operatorname{Var}(Y)=\sigma^2I_n$,
\begin{align*}
\operatorname{Var}(c^\top AY)
&=\operatorname{Var}(c^\top BY+c^\top(A-B)Y) \\
&=\sigma^2 c^\top BB^\top c
+2\sigma^2 c^\top B(A-B)^\top c
+\sigma^2 c^\top(A-B)(A-B)^\top c \\
&=\operatorname{Var}(c^\top BY)
+\sigma^2 c^\top(dv^\top)(vd^\top)c.
\end{align*}
The remaining scalar is
\begin{align*}
c^\top(dv^\top)(vd^\top)c
&=c^\top d\,(v^\top v)\,d^\top c \\
&=(c^\top d)^2 |v|^2.
\end{align*}
Hence
\begin{align*}
\operatorname{Var}(c^\top AY)-\operatorname{Var}(c^\top BY)
=\sigma^2 |v|^2(c^\top d)^2.
\end{align*}
Thus this modified estimator ties OLS for exactly those contrasts with $c^\top d=0$; when $v\ne 0$ and $c^\top d\ne 0$, it has strictly larger variance for that contrast.
[/example]
The equality condition also clarifies why the usual full-rank assumption matters. When $X$ lacks full column rank, $\beta$ is not fully identifiable, and the correct target class becomes estimable linear functions of $\beta$, not the entire parameter vector.
## Consequences and Limitations
What does the theorem not say? Many misuses of Gauss-Markov come from replacing its precise comparison class with a broader one.
[explanation: What Normality Adds]
Gauss-Markov does not assume $\varepsilon \sim \mathcal N(0,\sigma^2 I_n)$. If normality is added, then $\hat\beta$ is normally distributed and exact $t$ and $F$ inference can be derived under the usual residual variance estimator. Without normality, the BLUE statement remains valid, but finite-sample exact distributional inference generally needs additional arguments or asymptotic approximations.
[/explanation]
The theorem is also not a universal decision-theoretic optimality result. Bias can reduce variance enough to lower mean squared error, and nonlinear estimators can exploit structure that lies outside the linear unbiased class.
[example: Biased Shrinkage in One Dimension]
In the intercept-only model $Y_i=\mu+\varepsilon_i$, assume $\mathbb E[\varepsilon_i]=0$, $\operatorname{Var}(\varepsilon_i)=\sigma^2$, and the errors are uncorrelated. The sample mean $\bar Y=n^{-1}\sum_{i=1}^n Y_i$ is BLUE for $\mu$ among linear unbiased estimators, but the biased shrinkage estimator $\tilde\mu_a=a\bar Y$ with $0<a<1$ can have smaller mean squared error for some values of $\mu$.
First,
\begin{align*}
\mathbb E[\bar Y]
&= \mathbb E\left[\frac1n\sum_{i=1}^n Y_i\right] \\
&= \frac1n\sum_{i=1}^n \mathbb E[Y_i] \\
&= \frac1n\sum_{i=1}^n \mu \\
&= \mu,
\end{align*}
and, using uncorrelated errors,
\begin{align*}
\operatorname{Var}(\bar Y)
&= \operatorname{Var}\left(\frac1n\sum_{i=1}^n Y_i\right) \\
&= \operatorname{Var}\left(\frac1n\sum_{i=1}^n \varepsilon_i\right) \\
&= \sum_{i=1}^n \left(\frac1n\right)^2\operatorname{Var}(\varepsilon_i) \\
&= \sum_{i=1}^n \frac{\sigma^2}{n^2} \\
&= \frac{\sigma^2}{n}.
\end{align*}
Thus the sample mean has mean squared error
\begin{align*}
\mathbb E[(\bar Y-\mu)^2]
&= \operatorname{Var}(\bar Y)+(\mathbb E[\bar Y]-\mu)^2 \\
&= \frac{\sigma^2}{n}+0^2 \\
&= \frac{\sigma^2}{n}.
\end{align*}
For $\tilde\mu_a=a\bar Y$,
\begin{align*}
\mathbb E[\tilde\mu_a]
&= \mathbb E[a\bar Y] \\
&= a\mathbb E[\bar Y] \\
&= a\mu,
\end{align*}
so its bias is
\begin{align*}
\mathbb E[\tilde\mu_a]-\mu
&= a\mu-\mu \\
&= -(1-a)\mu.
\end{align*}
Also,
\begin{align*}
\operatorname{Var}(\tilde\mu_a)
&= \operatorname{Var}(a\bar Y) \\
&= a^2\operatorname{Var}(\bar Y) \\
&= a^2\frac{\sigma^2}{n}.
\end{align*}
Therefore
\begin{align*}
\mathbb E[(\tilde\mu_a-\mu)^2]
&= \operatorname{Var}(\tilde\mu_a)+(\mathbb E[\tilde\mu_a]-\mu)^2 \\
&= a^2\frac{\sigma^2}{n}+\bigl(-(1-a)\mu\bigr)^2 \\
&= a^2\frac{\sigma^2}{n}+(1-a)^2\mu^2.
\end{align*}
The shrinkage estimator beats $\bar Y$ in mean squared error exactly when
\begin{align*}
(1-a)^2\mu^2+a^2\frac{\sigma^2}{n}
&< \frac{\sigma^2}{n}.
\end{align*}
Subtracting $a^2\sigma^2/n$ from both sides gives
\begin{align*}
(1-a)^2\mu^2
&< (1-a^2)\frac{\sigma^2}{n}.
\end{align*}
Since $0<a<1$, we have $1-a>0$, and dividing by $1-a$ gives the equivalent condition
\begin{align*}
(1-a)\mu^2
&< (1+a)\frac{\sigma^2}{n}.
\end{align*}
Thus, for any fixed $0<a<1$, the biased estimator has smaller mean squared error whenever
\begin{align*}
\mu^2
&< \frac{1+a}{1-a}\frac{\sigma^2}{n}.
\end{align*}
This does not contradict Gauss-Markov: $\tilde\mu_a$ reduces variance by introducing bias, so it lies outside the class of linear unbiased estimators.
[/example]
This shrinkage example does not contradict Gauss-Markov because it leaves the unbiased class. It is a reminder that changing the decision criterion changes the theorem one should use.
[remark: Conditional Versus Fixed Design]
The usual statement treats $X$ as fixed. If $X$ is random, the same proof applies conditionally on $X$ whenever $X$ has full column rank and $\mathbb E[\varepsilon\mid X]=0$, $\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n$. The resulting conditional BLUE statement then implies the corresponding unconditional variance comparison after taking expectations.
[/remark]
These limitations are not defects. They identify the exact role of the theorem in regression: OLS is the canonical linear unbiased estimator under homoskedastic uncorrelated errors, while other modelling choices require other optimality criteria.
## Known Heteroskedasticity and Weighted Least Squares
What changes when the errors are uncorrelated but have known unequal variances? Then ordinary Euclidean projection is no longer the geometry matched to the noise; the right projection weights observations according to their precision.
[quotetheorem:4440]
[citeproof:4440]
The weighted theorem has its own necessary hypotheses. Positive definiteness ensures that $\Omega^{-1}$ exists and that every nonzero linear combination of errors has positive variance; if $\Omega$ is singular, some noiseless linear combinations may exist and the estimator has to be formulated with generalized inverses or by reducing the model. Full rank of $X$ again prevents non-identifiability of $\beta$, since weighting the covariance cannot recover a parameter direction absent from the mean vector $X\beta$. The covariance matrix must also be the one used in the weighting: if $\Omega$ is misspecified, the displayed estimator remains linear and unbiased but need not be best. When $\Omega$ is diagonal with entries $\sigma_i^2$, weighted least squares uses weights proportional to $1/\sigma_i^2$; high-variance observations are not discarded, but they receive less influence because they carry less information about the mean. This is the concrete bridge from ordinary least squares to generalized least squares.
[example: Heteroskedastic Two-Group Model]
Suppose $Y_{gj}=\mu+\varepsilon_{gj}$ for groups $g=1,2$ and observations $j=1,\dots,n_g$, with $\mathbb E[\varepsilon_{gj}]=0$, $\operatorname{Var}(\varepsilon_{gj})=\sigma_g^2$, and all errors uncorrelated. Write
\begin{align*}
\bar Y_g=\frac1{n_g}\sum_{j=1}^{n_g}Y_{gj}.
\end{align*}
Then
\begin{align*}
\mathbb E[\bar Y_g]
&= \frac1{n_g}\sum_{j=1}^{n_g}\mathbb E[Y_{gj}] \\
&= \frac1{n_g}\sum_{j=1}^{n_g}\mu \\
&= \mu,
\end{align*}
and, using uncorrelated errors,
\begin{align*}
\operatorname{Var}(\bar Y_g)
&= \operatorname{Var}\left(\frac1{n_g}\sum_{j=1}^{n_g}\varepsilon_{gj}\right) \\
&= \sum_{j=1}^{n_g}\left(\frac1{n_g}\right)^2\operatorname{Var}(\varepsilon_{gj}) \\
&= \sum_{j=1}^{n_g}\frac{\sigma_g^2}{n_g^2} \\
&= \frac{\sigma_g^2}{n_g}.
\end{align*}
Consider any linear unbiased estimator
\begin{align*}
\tilde\mu=\sum_{j=1}^{n_1}a_jY_{1j}+\sum_{j=1}^{n_2}b_jY_{2j}.
\end{align*}
Its expectation is
\begin{align*}
\mathbb E[\tilde\mu]
&= \sum_{j=1}^{n_1}a_j\mu+\sum_{j=1}^{n_2}b_j\mu \\
&= \mu\left(\sum_{j=1}^{n_1}a_j+\sum_{j=1}^{n_2}b_j\right),
\end{align*}
so unbiasedness for every $\mu$ requires
\begin{align*}
\sum_{j=1}^{n_1}a_j+\sum_{j=1}^{n_2}b_j=1.
\end{align*}
Let $\alpha=\sum_{j=1}^{n_1}a_j$, so $\sum_{j=1}^{n_2}b_j=1-\alpha$. For the first group,
\begin{align*}
0
&\le \sum_{j=1}^{n_1}\left(a_j-\frac{\alpha}{n_1}\right)^2 \\
&= \sum_{j=1}^{n_1}a_j^2-\frac{2\alpha}{n_1}\sum_{j=1}^{n_1}a_j+\sum_{j=1}^{n_1}\frac{\alpha^2}{n_1^2} \\
&= \sum_{j=1}^{n_1}a_j^2-\frac{2\alpha^2}{n_1}+\frac{\alpha^2}{n_1} \\
&= \sum_{j=1}^{n_1}a_j^2-\frac{\alpha^2}{n_1}.
\end{align*}
Thus $\sum_{j=1}^{n_1}a_j^2\ge \alpha^2/n_1$, with equality exactly when $a_j=\alpha/n_1$ for every $j$. Similarly,
\begin{align*}
0
&\le \sum_{j=1}^{n_2}\left(b_j-\frac{1-\alpha}{n_2}\right)^2 \\
&= \sum_{j=1}^{n_2}b_j^2-\frac{2(1-\alpha)}{n_2}\sum_{j=1}^{n_2}b_j+\sum_{j=1}^{n_2}\frac{(1-\alpha)^2}{n_2^2} \\
&= \sum_{j=1}^{n_2}b_j^2-\frac{2(1-\alpha)^2}{n_2}+\frac{(1-\alpha)^2}{n_2} \\
&= \sum_{j=1}^{n_2}b_j^2-\frac{(1-\alpha)^2}{n_2}.
\end{align*}
Therefore
\begin{align*}
\operatorname{Var}(\tilde\mu)
&= \sigma_1^2\sum_{j=1}^{n_1}a_j^2+\sigma_2^2\sum_{j=1}^{n_2}b_j^2 \\
&\ge \frac{\sigma_1^2\alpha^2}{n_1}+\frac{\sigma_2^2(1-\alpha)^2}{n_2}.
\end{align*}
Set $r_1=n_1/\sigma_1^2$ and $r_2=n_2/\sigma_2^2$. Then the lower bound becomes
\begin{align*}
\frac{\alpha^2}{r_1}+\frac{(1-\alpha)^2}{r_2}
&= \frac1{r_1+r_2}
+\frac{r_1+r_2}{r_1r_2}\left(\alpha-\frac{r_1}{r_1+r_2}\right)^2.
\end{align*}
Hence it is minimized when
\begin{align*}
\alpha=\frac{r_1}{r_1+r_2}
=\frac{n_1/\sigma_1^2}{n_1/\sigma_1^2+n_2/\sigma_2^2}.
\end{align*}
The equality conditions give equal weights within each group, so the minimum-variance linear unbiased estimator is
\begin{align*}
\hat\mu
&= \frac{r_1}{r_1+r_2}\bar Y_1+\frac{r_2}{r_1+r_2}\bar Y_2 \\
&= \frac{(n_1/\sigma_1^2)\bar Y_1+(n_2/\sigma_2^2)\bar Y_2}{n_1/\sigma_1^2+n_2/\sigma_2^2}.
\end{align*}
Its variance is
\begin{align*}
\operatorname{Var}(\hat\mu)
&= \frac1{r_1+r_2} \\
&= \left(\frac{n_1}{\sigma_1^2}+\frac{n_2}{\sigma_2^2}\right)^{-1}.
\end{align*}
If $\sigma_1^2=\sigma_2^2=\sigma^2$, then
\begin{align*}
\hat\mu
&= \frac{n_1\bar Y_1+n_2\bar Y_2}{n_1+n_2},
\end{align*}
the ordinary pooled sample mean; if one group has larger variance, its precision $n_g/\sigma_g^2$ is smaller, so its group mean receives less weight.
[/example]
Weighted least squares is therefore not a cosmetic modification of OLS. It is the Gauss-Markov estimator for a different covariance geometry, and it reduces to OLS only when the error covariance is proportional to the identity.
Chapter 4 establishes the efficiency benchmark for linear regression under moment assumptions: among linear unbiased estimators, OLS is best in the Gauss-Markov sense. Exact inference then adds the stronger Gaussian assumption, which turns these variance results into finite-sample distributions for test statistics and confidence intervals.
# 5. Exact Inference Under Normal Errors
Exact inference is the part of linear regression where the Gaussian error assumption pays for itself. Chapters 2--4 gave projection geometry, unbiasedness, variance formulae, and Gauss--Markov optimality for least squares under moment assumptions. In this chapter we add normal errors, which turns those geometric facts into finite-sample distributions for estimators, residual sums of squares, confidence intervals, prediction intervals, and tests of linear restrictions.
The central question is: when can regression inference be carried out without large-sample approximations? Under the model $Y = X\beta + \varepsilon$ with fixed full-rank design matrix $X$ and Gaussian error vector $\varepsilon \sim \mathcal N(0, \sigma^2 I_n)$, least squares estimators are normal, residual sums of squares are scaled chi-squared, and the relevant numerator and denominator terms are independent. This chapter develops those facts from the projection matrices introduced earlier.
This chapter turns the least-squares geometry from the previous chapters into exact finite-sample inference. The new ingredient is not a new estimator, but a stronger distributional assumption on the error vector. Under Gaussian errors, projections of the response vector have tractable distributions and independent orthogonal components, which is what makes exact $t$ and $F$ procedures possible.
## Gaussian Least Squares Distributions
What changes when the errors are normal rather than merely centred and homoskedastic? The answer is that every linear transformation of the data has an exact multivariate normal distribution, and orthogonal projections onto perpendicular subspaces become independent random components. Since fitted values and residuals are orthogonal projections of $Y$, the geometry of least squares becomes distribution theory.
Throughout this chapter, $X \in \mathbb R^{n \times p}$ has rank $p$, the response vector is $Y \in \mathbb R^n$, and
\begin{align*}
Y = X\beta + \varepsilon, \qquad \varepsilon \sim \mathcal N(0, \sigma^2 I_n),
\end{align*}
where $\beta \in \mathbb R^p$ and $\sigma^2 > 0$ are unknown. The least squares estimator is
\begin{align*}
\hat\beta = (X^\top X)^{-1}X^\top Y.
\end{align*}
The fitted vector and residual vector are
\begin{align*}
\hat Y = HY, \qquad e = (I_n-H)Y,
\end{align*}
where $H = X(X^\top X)^{-1}X^\top$ is the hat matrix.
The fitted estimator is a linear transformation of the response vector. Under Gaussian errors, linear transformations remain Gaussian, so the next formal step is to identify the exact mean and covariance of $\hat\beta$ rather than only its first two moments.
[quotetheorem:1442]
[citeproof:1442]
This theorem says that each coefficient estimator is exactly normal around its target, with covariance determined by the design. The full-rank condition is needed so that $(X^\top X)^{-1}$ exists; if the columns of $X$ are linearly dependent, the coefficient vector is not identifiable without imposing extra constraints. The covariance assumption $\sigma^2 I_n$ is also doing real work: heteroskedastic or correlated errors change the covariance matrix of $\hat\beta$ and invalidate the standard errors used below. Without Gaussian errors, the mean and variance formulae still hold under the earlier moment assumptions, but the exact normal distribution of $\hat\beta$ need not hold in finite samples. The only unknown quantity in the exact standard error is therefore $\sigma^2$, so the next step is to estimate $\sigma^2$ from the residuals.
[definition: Residual Sum of Squares]
For a fixed full-rank design matrix $X \in \mathbb R^{n \times p}$, the residual sum of squares is the statistic $\operatorname{RSS}:\mathbb R^n \to \mathbb R$ defined by
\begin{align*}
\operatorname{RSS}(Y) = e^\top e = |Y-X\hat\beta|^2,
\end{align*}
where $\hat\beta=(X^\top X)^{-1}X^\top Y$ and $e=(I_n-H)Y$.
[/definition]
The residual sum of squares measures the squared length of the component of $Y$ orthogonal to the column space of $X$. Since that orthogonal complement has dimension $n-p$, the correct degrees of freedom for estimating $\sigma^2$ are $n-p$.
This motivates turning RSS into a scale estimator rather than using RSS directly. The definition below fixes the degrees-of-freedom adjustment that will appear in standard errors, $t$ statistics, and interval widths.
[definition: Unbiased Error Variance Estimator]
For a fixed full-rank design matrix $X \in \mathbb R^{n \times p}$ with $n>p$, the unbiased error variance estimator is the statistic $\hat\sigma^2:\mathbb R^n \to \mathbb R$ defined by
\begin{align*}
\hat\sigma^2(Y) = \frac{\operatorname{RSS}(Y)}{n-p}.
\end{align*}
[/definition]
The division by $n-p$ is not only a bias correction; it is the dimension of the residual subspace. For inference we need more than the expectation of this statistic: confidence intervals and tests require its sampling distribution under the Gaussian model. The theorem below identifies that distribution and explains why the residual sum of squares behaves like an independent chi-squared quantity with $n-p$ degrees of freedom.
[quotetheorem:4474]
[citeproof:4474]
This result depends on the residual matrix being an orthogonal projection with rank $n-p$. If the error covariance is not spherical, rotating into the residual subspace no longer produces independent equal-variance normal coordinates, so the chi-squared reference distribution may fail. The same projection decomposition also separates the information used for estimating the mean vector from the information used for estimating the noise level, and that separation is the independence statement needed for Studentisation.
For inference about coefficients, it is not enough to know the residual variance estimator has a chi-squared distribution. We also need it to be independent of the coefficient estimator, because the usual $t$ and $F$ pivots divide a normal numerator by an independently estimated noise scale. The following result supplies this separation for the fixed Gaussian linear model.
[quotetheorem:4441]
[citeproof:4441]
This independence is the reason Student $t$ and $F$ distributions appear. Zero covariance alone would not be enough for arbitrary error distributions; Gaussianity upgrades uncorrelated orthogonal projections into independent random vectors. The result does not say that the fitted values and residuals are independent after model selection or after using residual plots to change the design matrix. In the fixed-model setting, however, the numerator of a test statistic comes from the normal distribution of $\hat\beta$, while the denominator comes from the independent chi-squared estimate of $\sigma^2$.
[example: Projection Decomposition in a Two-Parameter Regression]
Let $X$ have columns $\mathbf 1$ and $x$, and assume these two columns are linearly independent, so $\operatorname{rank}(X)=2$. Let $H=X(X^\top X)^{-1}X^\top$ be the orthogonal projection onto $\operatorname{col}(X)$, and let $M=I_n-H$ be the projection onto $\operatorname{col}(X)^\perp$. Since $X\beta \in \operatorname{col}(X)$, we have
\begin{align*}
MX\beta
&=(I_n-H)X\beta \\
&=X\beta-HX\beta \\
&=X\beta-X(X^\top X)^{-1}X^\top X\beta \\
&=X\beta-X\beta \\
&=0.
\end{align*}
Thus the residual vector is
\begin{align*}
e=MY=M(X\beta+\varepsilon)=MX\beta+M\varepsilon=M\varepsilon.
\end{align*}
Choose an orthonormal basis $u_1,u_2,u_3,\dots,u_n$ of $\mathbb R^n$ such that $u_1,u_2$ span $\operatorname{col}(X)$ and $u_3,\dots,u_n$ span $\operatorname{col}(X)^\perp$. If $Q$ is the orthogonal matrix with these basis vectors as columns, then in these coordinates
\begin{align*}
Q^\top H Q
=
\begin{pmatrix}
1&0&0&\cdots&0\\
0&1&0&\cdots&0\\
0&0&0&\cdots&0\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
0&0&0&\cdots&0
\end{pmatrix},
\qquad
Q^\top M Q
=
\begin{pmatrix}
0&0&0&\cdots&0\\
0&0&0&\cdots&0\\
0&0&1&\cdots&0\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
0&0&0&\cdots&1
\end{pmatrix}.
\end{align*}
Write $Z=Q^\top \varepsilon/\sigma$. Because $Q$ is orthogonal and $\varepsilon\sim \mathcal N(0,\sigma^2 I_n)$, we have $Z\sim \mathcal N(0,I_n)$. Also,
\begin{align*}
\frac{\operatorname{RSS}}{\sigma^2}
&=\frac{e^\top e}{\sigma^2} \\
&=\frac{\varepsilon^\top M^\top M\varepsilon}{\sigma^2} \\
&=\frac{\varepsilon^\top M\varepsilon}{\sigma^2} \qquad \text{since } M^\top=M \text{ and } M^2=M\\
&=Z^\top(Q^\top M Q)Z \\
&=Z_3^2+\cdots+Z_n^2.
\end{align*}
The fitted vector depends only on the first two rotated coordinates, while the residual sum of squares is the sum of squares of the remaining $n-2$ standard normal coordinates. This is why the residual degrees of freedom are $n-2$: two directions have been used to fit the intercept and slope, leaving $n-2$ orthogonal error directions.
[/example]
## Tests and Intervals for Individual Coefficients
How do we test whether one explanatory variable matters after adjusting for the others? The coefficient $\beta_j$ measures the conditional linear contribution of the $j$th column of $X$, holding the other columns in the model. Exact inference for this coefficient uses the normal distribution of $\hat\beta_j$ and replaces $\sigma$ by the independent estimate $\hat\sigma$.
Let $c_{jj}$ denote the $j$th diagonal entry of $(X^\top X)^{-1}$. Then
\begin{align*}
\operatorname{Var}(\hat\beta_j) = \sigma^2 c_{jj},
\end{align*}
so the estimated standard error is
\begin{align*}
\operatorname{se}(\hat\beta_j) = \hat\sigma\sqrt{c_{jj}}.
\end{align*}
[quotetheorem:4442]
[citeproof:4442]
If $\sigma$ were known, the corresponding statistic would be standard normal rather than Student $t$. The $t$ distribution appears because the same data are used to estimate $\sigma^2$, and the estimate is independent of the coefficient numerator only under the Gaussian projection argument above. This test concerns the coefficient in the specified linear model; it does not by itself establish causality, protect against omitted variables, or account for choosing the variable after inspecting several alternatives. The two-sided test at level $\alpha$ rejects $H_0$ when
\begin{align*}
|T_j| > t_{n-p,1-\alpha/2},
\end{align*}
where $t_{\nu,q}$ denotes the $q$-quantile of the Student $t$ distribution with $\nu$ degrees of freedom.
[example: Advertising Effect Controlling for Price]
Suppose $Y_i$ is sales for store $i$, $x_{i1}$ is advertising spend, and $x_{i2}$ is price. In the model
\begin{align*}
Y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\varepsilon_i,
\end{align*}
the null hypothesis that advertising has no effect after controlling for price is $H_0:\beta_1=0$. By *Student t Statistic for a Regression Coefficient*, the test statistic under $H_0$ is
\begin{align*}
T_1
&=\frac{\hat\beta_1-0}{\operatorname{se}(\hat\beta_1)}\\
&=\frac{2.4}{0.8}\\
&=3.
\end{align*}
Since the residual degrees of freedom are $n-p=27$, the reference distribution under $H_0$ is $t_{27}$. For a two-sided test at the $5\%$ level, a Student $t$ quantile table gives $t_{27,0.975}\approx 2.052$, and
\begin{align*}
|T_1|=3>2.052.
\end{align*}
Thus the data reject $H_0$ at the $5\%$ level, giving evidence that advertising is associated with sales after adjusting for price in this fitted linear model.
[/example]
The same statistic also gives confidence intervals. A confidence interval collects the parameter values that would not be rejected by the corresponding two-sided test. This inversion works because the null value enters the numerator linearly and the reference distribution does not depend on the unknown value of $\beta_j$.
The remaining issue is calibration: a reported interval must specify both its endpoints and the probability statement that justifies them. Without the Gaussian fixed-design distribution, the familiar estimate plus-or-minus critical value expression would be only a heuristic. The next result gives the exact interval obtained by inverting the Studentized coefficient test and states its coverage under the model assumptions.
[quotetheorem:4443]
[citeproof:4443]
These intervals are conditional on the chosen model and the fixed design matrix. The full-rank hypothesis ensures that the diagonal entry $c_{jj}$ is finite; near-collinearity does not invalidate the formula, but it can make the interval very wide. The interval quantifies sampling uncertainty in the coefficient, not uncertainty about whether the model was selected after looking at the data. If the error distribution is non-Gaussian, the same expression is often used as a large-sample approximation, but the exact finite-sample coverage statement is no longer justified by this theorem.
## Mean Response and Prediction Intervals
If the fitted model is used at a new covariate vector, what uncertainty should be reported? There are two different targets: the mean response at the covariate vector, and a future observation at the same covariate vector. The future observation has extra noise, so its interval is wider.
Let $x_0 \in \mathbb R^p$ be a fixed covariate vector, including the intercept component when the model has an intercept. The mean response at $x_0$ is
\begin{align*}
m(x_0)=x_0^\top\beta,
\end{align*}
and its least squares estimator is
\begin{align*}
\hat m(x_0)=x_0^\top\hat\beta.
\end{align*}
[quotetheorem:4444]
[citeproof:4444]
The leverage quantity $x_0^\top(X^\top X)^{-1}x_0$ is the new ingredient compared with a coefficient interval. It measures how well the design supports estimating the mean at the chosen covariate vector. The interval is about the regression surface at $x_0$, so it excludes the fresh observation noise that would appear in a future response.
[example: Confidence Interval for Average Sales at a Given Price]
In the sales model with intercept, advertising, and price, the covariate vector
\begin{align*}
x_0=(1,10,4)^\top
\end{align*}
represents advertising spend $10$ and price $4$. If
\begin{align*}
\hat\beta=(\hat\beta_0,\hat\beta_1,\hat\beta_2)^\top,
\end{align*}
then the fitted mean response at this covariate vector is
\begin{align*}
x_0^\top\hat\beta
&=(1,10,4)
\begin{pmatrix}
\hat\beta_0\\
\hat\beta_1\\
\hat\beta_2
\end{pmatrix}\\
&=1\cdot \hat\beta_0+10\cdot \hat\beta_1+4\cdot \hat\beta_2\\
&=\hat\beta_0+10\hat\beta_1+4\hat\beta_2.
\end{align*}
By *Confidence Interval for a Mean Response*, a $100(1-\alpha)\%$ confidence interval for the average sales at advertising spend $10$ and price $4$ is
\begin{align*}
\hat\beta_0+10\hat\beta_1+4\hat\beta_2
\ \pm\
t_{n-3,1-\alpha/2}\,\hat\sigma
\sqrt{(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}}.
\end{align*}
Here $p=3$ because the model estimates an intercept, an advertising slope, and a price slope, so the residual degrees of freedom are $n-p=n-3$. The quantity
\begin{align*}
(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}
\end{align*}
is the leverage factor for this covariate vector. When this factor is small, the square-root term is small and the confidence interval is narrow; when this factor is large, the same formula gives a wider interval because the design contains less information about the mean response at that point.
[/example]
For prediction, the target is not the mean $x_0^\top\beta$ but a new response
\begin{align*}
Y_0 = x_0^\top\beta + \varepsilon_0,
\end{align*}
where $\varepsilon_0 \sim \mathcal N(0,\sigma^2)$ is independent of the training errors.
This changes the uncertainty calculation in a concrete way. A confidence interval for the mean response accounts only for uncertainty in estimating $x_0^\top\beta$, whereas a prediction interval must also account for the fresh random error in the new observation itself. We need a formula that displays both sources of uncertainty in the same Gaussian linear-model calculation.
[quotetheorem:4472]
[citeproof:4472]
The independence of the future error $\varepsilon_0$ is essential here. If the future observation is correlated with the training errors, or if its variance differs from $\sigma^2$, the extra $1$ inside the square root is no longer the correct adjustment. This theorem also explains why prediction is intrinsically harder than estimating the conditional mean: even perfect knowledge of $\beta$ would not remove the new error term.
[example: Prediction Interval for One Future Store]
Using the same covariate vector $x_0=(1,10,4)^\top$, the fitted value for the future store is
\begin{align*}
x_0^\top\hat\beta
&=(1,10,4)
\begin{pmatrix}
\hat\beta_0\\
\hat\beta_1\\
\hat\beta_2
\end{pmatrix}\\
&=\hat\beta_0+10\hat\beta_1+4\hat\beta_2.
\end{align*}
By *Prediction Interval for a New Observation*, a $100(1-\alpha)\%$ prediction interval for one future store with advertising spend $10$ and price $4$ is
\begin{align*}
\hat\beta_0+10\hat\beta_1+4\hat\beta_2
\ \pm\
t_{n-3,1-\alpha/2}\,\hat\sigma
\sqrt{1+(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}}.
\end{align*}
Here $p=3$, so the residual degrees of freedom are $n-p=n-3$.
The corresponding confidence interval for the mean response at the same covariate vector has standard-error factor
\begin{align*}
\hat\sigma
\sqrt{(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}},
\end{align*}
whereas the prediction interval has standard-error factor
\begin{align*}
\hat\sigma
\sqrt{1+(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}}.
\end{align*}
Since $(X^\top X)^{-1}$ is positive definite and $x_0\ne 0$, the leverage term
\begin{align*}
(1,10,4)(X^\top X)^{-1}
\begin{pmatrix}
1\\
10\\
4
\end{pmatrix}
\end{align*}
is positive, so adding $1$ makes the prediction standard-error factor larger. The prediction interval is therefore wider because it includes both uncertainty in estimating $x_0^\top\beta$ and the additional future error of the individual store.
[/example]
The distinction between a mean-response interval and a prediction interval is a common source of mistakes. The mean-response interval answers a question about the regression surface, while the prediction interval answers a question about a future random outcome.
## F-Tests for Nested Models and Linear Restrictions
How do we test several coefficients at once? Testing coefficients separately with several $t$-tests does not give the same procedure as testing a joint null hypothesis. The joint test compares the loss of fit imposed by the restrictions with the residual variation left after fitting the unrestricted model.
A nested model is obtained by imposing linear restrictions on the full model. The general form is
\begin{align*}
H_0: R\beta = r,
\end{align*}
where $R \in \mathbb R^{q \times p}$ has rank $q$ and $r \in \mathbb R^q$. The unrestricted model estimates all $p$ components of $\beta$, while the restricted model estimates only those values satisfying the constraints.
[definition: Restricted Residual Sum of Squares]
For a fixed full-rank design matrix $X \in \mathbb R^{n \times p}$ and a restriction $R\beta=r$ with $R \in \mathbb R^{q \times p}$, the restricted residual sum of squares is the statistic $\operatorname{RSS}_0:\mathbb R^n \to \mathbb R$ defined by
\begin{align*}
\operatorname{RSS}_0(Y)=\min\{|Y-Xb|^2 : b \in \mathbb R^p,\ Rb=r\}.
\end{align*}
[/definition]
The unrestricted residual sum of squares is denoted $\operatorname{RSS}_1$, with $\operatorname{RSS}_1 \le \operatorname{RSS}_0$. The difference $\operatorname{RSS}_0-\operatorname{RSS}_1$ measures how much worse the fit becomes when the restrictions are imposed.
A raw increase in residual sum of squares is not yet a test statistic, because it grows with the noise scale and with the number of restrictions being imposed. We need to compare the loss of fit per restriction with the residual variation left in the unrestricted model. The following result gives the resulting $F$ statistic and its exact null distribution in the Gaussian fixed-design model.
[quotetheorem:4445]
[citeproof:4445]
The test rejects for large values of $F$, since large values mean that the restrictions substantially increase the residual sum of squares compared with the residual noise scale. The rank condition on $R$ prevents repeated or redundant restrictions from being counted more than once in the numerator degrees of freedom. The test is joint: it can reject because several individually modest coefficient deviations align in a direction with small estimated variance. This connects the regression test with ANOVA decompositions and likelihood-ratio reasoning, where loss of fit is compared with unexplained residual variation.
[example: Joint Test of Advertising and Price Effects]
In the model
\begin{align*}
Y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\varepsilon_i,
\end{align*}
the null hypothesis that neither advertising nor price affects sales is
\begin{align*}
H_0:\beta_1=0,\qquad \beta_2=0.
\end{align*}
Writing $\beta=(\beta_0,\beta_1,\beta_2)^\top$, this joint restriction has the matrix form
\begin{align*}
R\beta
&=
\begin{pmatrix}
0&1&0\\
0&0&1
\end{pmatrix}
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2
\end{pmatrix} \\
&=
\begin{pmatrix}
0\cdot\beta_0+1\cdot\beta_1+0\cdot\beta_2\\
0\cdot\beta_0+0\cdot\beta_1+1\cdot\beta_2
\end{pmatrix} \\
&=
\begin{pmatrix}
\beta_1\\
\beta_2
\end{pmatrix}
=
\begin{pmatrix}
0\\
0
\end{pmatrix}.
\end{align*}
Thus $q=2$, because $R$ has two linearly independent rows.
Under the restriction, the fitted values have the form
\begin{align*}
\hat Y_{0,i}
&=b_0+b_1x_{i1}+b_2x_{i2}
\quad\text{with } b_1=0,\ b_2=0 \\
&=b_0,
\end{align*}
so the restricted model contains only an intercept. The restricted residual sum of squares is therefore
\begin{align*}
\operatorname{RSS}_0
=
\min_{b_0\in\mathbb R}
\sum_{i=1}^n (Y_i-b_0)^2.
\end{align*}
The unrestricted model allows all three coefficients to vary, so
\begin{align*}
\operatorname{RSS}_1
=
\min_{b_0,b_1,b_2\in\mathbb R}
\sum_{i=1}^n (Y_i-b_0-b_1x_{i1}-b_2x_{i2})^2.
\end{align*}
Because the unrestricted minimisation is over a larger set of coefficient triples, every restricted fit is also an admissible unrestricted fit, and hence
\begin{align*}
\operatorname{RSS}_1\le \operatorname{RSS}_0.
\end{align*}
The partial $F$ statistic is
\begin{align*}
F
&=
\frac{(\operatorname{RSS}_0-\operatorname{RSS}_1)/q}{\operatorname{RSS}_1/(n-p)} \\
&=
\frac{(\operatorname{RSS}_0-\operatorname{RSS}_1)/2}{\operatorname{RSS}_1/(n-3)},
\end{align*}
since the full model estimates $p=3$ coefficients: intercept, advertising slope, and price slope. The numerator measures the extra residual sum of squares left by forcing $\beta_1=\beta_2=0$, averaged over the two restrictions. The denominator estimates the error variance from the unrestricted model using $n-3$ residual degrees of freedom. Large values of $F$ mean that adding advertising and price reduces residual variation by more than would be expected relative to the remaining noise scale, so the joint test can reject even when the two individual coefficient tests do not separately reject.
[/example]
The residual-sum-of-squares form is most transparent for nested models obtained by dropping predictors. There is also a direct matrix formula for the same test, useful when restrictions compare coefficients or impose nonzero values rather than merely setting a block of coefficients equal to zero. This formula makes explicit that the numerator is a quadratic form in a Gaussian vector.
[quotetheorem:4446]
[citeproof:4446]
When $q=1$, the partial $F$ test is equivalent to the square of the corresponding two-sided $t$ test. For multiple restrictions, the $F$ test captures joint evidence and accounts for covariance between coefficient estimators.
[example: Testing Equality of Two Slopes]
Suppose the coefficient vector is ordered as $\beta=(\beta_0,\beta_1,\beta_2,\dots,\beta_{p-1})^\top$, where $\beta_0$ is the intercept and $\beta_1,\beta_2$ are the two slopes being compared. The null hypothesis that the two slope effects are equal is
\begin{align*}
H_0:\beta_1-\beta_2=0.
\end{align*}
This is a single linear restriction, so $q=1$, and it can be written as $R\beta=0$ with
\begin{align*}
R=(0,1,-1,0,\dots,0).
\end{align*}
Indeed,
\begin{align*}
R\beta
&=
(0,1,-1,0,\dots,0)
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\vdots\\
\beta_{p-1}
\end{pmatrix} \\
&=0\cdot\beta_0+1\cdot\beta_1+(-1)\cdot\beta_2+0\cdot\beta_3+\cdots+0\cdot\beta_{p-1}\\
&=\beta_1-\beta_2.
\end{align*}
Let $C=(X^\top X)^{-1}$, and write $C_{jk}$ for the entry corresponding to coefficients $\beta_j$ and $\beta_k$. The estimated contrast is
\begin{align*}
R\hat\beta
&=
(0,1,-1,0,\dots,0)
\begin{pmatrix}
\hat\beta_0\\
\hat\beta_1\\
\hat\beta_2\\
\vdots\\
\hat\beta_{p-1}
\end{pmatrix}\\
&=\hat\beta_1-\hat\beta_2.
\end{align*}
Its variance factor is
\begin{align*}
RCR^\top
&=
(0,1,-1,0,\dots,0)C
\begin{pmatrix}
0\\
1\\
-1\\
0\\
\vdots\\
0
\end{pmatrix}\\
&=1^2C_{11}+1(-1)C_{12}+(-1)1C_{21}+(-1)^2C_{22}\\
&=C_{11}-C_{12}-C_{21}+C_{22}\\
&=C_{11}+C_{22}-2C_{12},
\end{align*}
because $C$ is symmetric. Therefore the Wald statistic for testing $H_0$ is
\begin{align*}
F
&=
\frac{1}{1}
\frac{(R\hat\beta-0)^2}{\hat\sigma^2\,RCR^\top}\\
&=
\frac{(\hat\beta_1-\hat\beta_2)^2}
{\hat\sigma^2\left(C_{11}+C_{22}-2C_{12}\right)}.
\end{align*}
Equivalently,
\begin{align*}
T
=
\frac{\hat\beta_1-\hat\beta_2}
{\hat\sigma\sqrt{C_{11}+C_{22}-2C_{12}}}
\end{align*}
satisfies $F=T^2$ for this one-restriction test. The covariance term $C_{12}$ is essential: the uncertainty in $\hat\beta_1-\hat\beta_2$ is not obtained by looking at the two coefficient intervals separately, because those intervals use the marginal variance factors $C_{11}$ and $C_{22}$ rather than the variance of the difference.
[/example]
## Interpreting Exact Inference
What assumptions support the exact reference distributions? The finite-sample $t$ and $F$ results require the linear mean model, fixed design or conditioning on the observed design, full column rank, independent Gaussian errors with common variance, and no data-dependent model search built into the reported reference distribution. If these assumptions are weakened, the same formulae may still be useful approximations, but their exact finite-sample interpretation is lost.
[remark: Conditional Nature of Regression Inference]
The distributions in this chapter are conditional on the design matrix $X$. If the covariates are random, the same conclusions hold conditionally on $X$ whenever the conditional error distribution is Gaussian with covariance $\sigma^2 I_n$ and $X$ has rank $p$.
[/remark]
The first remark settles one point of interpretation. The following remark, What Normality Adds, records a second boundary case before the notes move on.
[remark: What Normality Adds]
Unbiasedness of $\hat\beta$ and the variance formula for $\hat\beta$ do not require Gaussian errors. Normality adds exact finite-sample distributions and independence between fitted and residual components. This is why the formulas in this chapter are sharper than large-sample standard error calculations.
[/remark]
The chapter can be summarised as a chain of pivots. Normality gives $\hat\beta$ a normal distribution; projection geometry gives $\operatorname{RSS}/\sigma^2$ a chi-squared distribution with $n-p$ degrees of freedom; orthogonality plus Gaussianity gives independence; and those facts combine into Student $t$ and $F$ statistics for exact inference under normal errors.
Under normal errors, the geometry of least squares becomes a full distributional theory: the estimator is normal, the residual sum of squares is chi-squared, and these pieces combine into exact $t$ and $F$ procedures. When the Gaussian model is too strong, we need asymptotic arguments and robust covariance estimates instead.
# 6. Large-Sample Inference and Robust Standard Errors
Large-sample inference asks what remains valid when the finite-sample Gaussian model is no longer a credible description of the data. Chapter 5 used exact formulas under Gaussian homoskedastic independent errors, but applied regression often has random covariates, unequal conditional variances, and grouped dependence. The central idea of this chapter is that OLS can still be approximately normal, while the covariance matrix used for inference must match the sampling structure.
## Asymptotic Normality Under Random Design
The first problem is to understand why least squares can have a normal approximation even when the responses are not normally distributed. In random-design regression, the rows $(Y_i, X_i)$ are random observations, so the design matrix is part of the sampling process rather than a fixed object. The OLS estimator is therefore a statistic built from sample averages, and its limiting behaviour is governed by laws of large numbers and central limit theorems.
[definition: Random Design Linear Regression]
Let $(Y_i, X_i)_{i=1}^n$ be i.i.d. random vectors with $Y_i \in \mathbb R$ and $X_i \in \mathbb R^p$. A random design linear regression model specifies
\begin{align*}
Y_i = X_i^\top \beta_0 + u_i,
\end{align*}
where $\beta_0 \in \mathbb R^p$ and $u_i = Y_i - X_i^\top \beta_0$ satisfies $\mathbb E[X_i u_i] = 0$.
[/definition]
The condition $\mathbb E[X_i u_i] = 0$ is the population orthogonality condition. It says that the linear predictor $x \mapsto x^\top \beta_0$ solves the population normal equations, whether or not the full conditional mean $\mathbb E[Y_i \mid X_i]$ is exactly linear.
At the sample level, the unknown population equations cannot be solved directly, so we need the empirical rule that replaces expectations by observed averages. This is the random-design version of OLS: it treats the covariate rows as sampled data and chooses the coefficient vector that minimises the observed squared prediction errors.
[definition: Random Design OLS Estimator]
Given observations $(Y_i, X_i)_{i=1}^n$ with $X_i \in \mathbb R^p$, the ordinary least squares estimator is any minimiser
\begin{align*}
\hat\beta_n \in \operatorname*{argmin}_{b \in \mathbb R^p} \sum_{i=1}^n (Y_i - X_i^\top b)^2.
\end{align*}
When $\sum_{i=1}^n X_iX_i^\top$ is invertible, it is
\begin{align*}
\hat\beta_n = \left(\sum_{i=1}^n X_iX_i^\top\right)^{-1}\sum_{i=1}^n X_iY_i.
\end{align*}
[/definition]
The estimator has a useful decomposition obtained by substituting $Y_i=X_i^\top\beta_0+u_i$:
\begin{align*}
\hat\beta_n - \beta_0
= \left(\frac{1}{n}\sum_{i=1}^n X_iX_i^\top\right)^{-1}
\left(\frac{1}{n}\sum_{i=1}^n X_i u_i\right).
\end{align*}
Thus consistency comes from the second factor converging to $0$, while asymptotic normality comes from scaling that factor by $\sqrt n$.
The decomposition also shows the two possible failures: the empirical second-moment matrix may approach a singular limit, or the empirical regressor-error average may not vanish. A consistency theorem must rule out both failures at once. The next result states conditions that make the curvature invertible in the limit and make the sample moment converge to its population value zero.
[quotetheorem:4447]
[citeproof:4447]
The orthogonality assumption is essential: if $\mathbb E[X_i u_i]\ne 0$, the sample normal equations converge to the wrong population equations and OLS targets a pseudo-parameter rather than $\beta_0$. Positive definiteness of $Q$ rules out exact population collinearity; for instance, if one regressor is a deterministic linear combination of the others, the coefficient vector is not identified. The theorem proves convergence of the coefficient only, not valid confidence intervals, so the next step is to identify the limiting distribution and its covariance.
[definition: Meat Matrix]
In random design regression with moment condition $\mathbb E[X_i u_i]=0$, define
\begin{align*}
\Omega := \mathbb E[u_i^2 X_iX_i^\top].
\end{align*}
[/definition]
The matrix $Q$ measures the curvature of the least-squares objective, while $\Omega$ measures the variance of the score contribution $X_i u_i$. If errors are conditionally homoskedastic with variance $\sigma^2$, then $\Omega = \sigma^2 Q$; otherwise these matrices are not proportional.
Consistency alone does not say how large the remaining estimation error is at sample size $n$. For inference, we need the scale and shape of the limiting fluctuation of $\sqrt n(\hat\beta_n-\beta_0)$. The obstacle is that random design combines curvature from $Q$ with score variability from $\Omega$, so the limiting covariance must account for both pieces.
[quotetheorem:4448]
[citeproof:4448]
The finite second moment condition is doing real work: if $X_i u_i$ has infinite variance, the classical central limit theorem may fail and a normal approximation can be misleading. Fixed $p$ also matters, because high-dimensional regression requires different arguments and different covariance corrections. The theorem gives the limiting law under the target moment condition, but it does not yet tell us how to estimate $\Omega$ from data when $u_i$ is unobserved.
[example: Heteroskedastic Wage Regression]
Suppose $Y_i$ is log wage and $X_i=(1,\operatorname{educ}_i,\operatorname{exper}_i)^\top$, with
\begin{align*}
Y_i=X_i^\top\beta_0+u_i,
\qquad
\mathbb E[X_i u_i]=0.
\end{align*}
Write the conditional error variance as
\begin{align*}
\sigma^2(X_i):=\mathbb E[u_i^2\mid X_i].
\end{align*}
Then the meat matrix in the asymptotic covariance is
\begin{align*}
\Omega
&=\mathbb E[u_i^2X_iX_i^\top] \\
&=\mathbb E[\mathbb E[u_i^2X_iX_i^\top\mid X_i]] \\
&=\mathbb E[X_iX_i^\top\mathbb E[u_i^2\mid X_i]] \\
&=\mathbb E[\sigma^2(X_i)X_iX_i^\top].
\end{align*}
If the classical homoskedastic formula were valid, there would be a constant $\sigma^2$ such that
\begin{align*}
\Omega
=\mathbb E[\sigma^2 X_iX_i^\top]
=\sigma^2\mathbb E[X_iX_i^\top]
=\sigma^2 Q.
\end{align*}
But when high-education workers have larger conditional wage dispersion, $\sigma^2(X_i)$ changes with $\operatorname{educ}_i$, so $\mathbb E[\sigma^2(X_i)X_iX_i^\top]$ is not generally equal to $\sigma^2Q$ for any common variance $\sigma^2$.
The OLS coefficient is still
\begin{align*}
\hat\beta_n
=\left(\sum_{i=1}^n X_iX_i^\top\right)^{-1}\sum_{i=1}^n X_iY_i,
\end{align*}
so the fitted slope on education is not changed by using robust inference. What changes is the covariance estimate: the sandwich formula estimates $Q^{-1}\Omega Q^{-1}$ with an empirical middle matrix built from $\hat u_i^2X_iX_i^\top$, so the reported education standard error reflects the larger residual variation attached to high-education observations.
[/example]
This example separates two questions that are often conflated. The OLS coefficient targets a best linear prediction parameter under $\mathbb E[X_i u_i]=0$, while the standard error estimates how variable that coefficient is across repeated samples.
## Heteroskedasticity-Consistent Covariance Estimation
The next problem is practical: the asymptotic covariance $Q^{-1}\Omega Q^{-1}$ contains unknown population moments and unobserved errors. A covariance estimator must replace these objects by sample analogues using residuals.
[definition: OLS Residual]
For the OLS estimator $\hat\beta_n$, the residual for observation $i$ is
\begin{align*}
\hat u_i := Y_i - X_i^\top\hat\beta_n.
\end{align*}
[/definition]
Residuals are not the same as the true errors, but under consistency they are close enough for estimating average second moments in many large-sample settings. The basic heteroskedasticity-consistent estimator replaces $u_i^2X_iX_i^\top$ by $\hat u_i^2X_iX_i^\top$.
The practical problem is to turn the asymptotic covariance formula into a computable matrix using only the fitted regression output. HC0 does this by keeping the exact OLS curvature matrices from the sample and estimating the middle second-moment matrix with squared residuals, allowing different observations to have different error variances.
[definition: HC0 Covariance Estimator]
Let $X$ be the $n \times p$ design matrix with row $X_i^\top$, and let $\hat u_i=Y_i-X_i^\top\hat\beta_n$. The HC0 covariance estimator for $\hat\beta_n$ is
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{HC0}}(\hat\beta_n)
:= (X^\top X)^{-1}\left(\sum_{i=1}^n \hat u_i^2 X_iX_i^\top\right)(X^\top X)^{-1}.
\end{align*}
[/definition]
The formula is often called a sandwich estimator: the middle matrix estimates score variability, and the two outer matrices estimate the inverse curvature of the least-squares criterion.
This name is useful because the same structure appears beyond OLS, whenever an estimator is defined by empirical moment equations. Separating curvature from score variability helps identify which part changes under heteroskedasticity, clustering, or other dependence corrections. The following definition records that reusable pattern independently of the particular HC0 formula.
[definition: Sandwich Form]
A covariance estimator has sandwich form when it can be written as
\begin{align*}
\widehat V_n = \widehat Q_n^{-1}\widehat\Omega_n\widehat Q_n^{-1},
\end{align*}
where $\widehat Q_n$ estimates the curvature matrix and $\widehat\Omega_n$ estimates the variance of the estimating equation.
[/definition]
For OLS, the covariance of $\sqrt n(\hat\beta_n-\beta_0)$ is estimated by
\begin{align*}
\widehat Q_n^{-1}\widehat\Omega_n\widehat Q_n^{-1},
\qquad
\widehat Q_n = \frac{1}{n}X^\top X,
\qquad
\widehat\Omega_n = \frac{1}{n}\sum_{i=1}^n \hat u_i^2X_iX_i^\top.
\end{align*}
The covariance of $\hat\beta_n$ itself is this matrix divided by $n$, which is algebraically the HC0 expression above.
Because finite samples use residuals after estimating $p$ coefficients, the raw HC0 scale can understate uncertainty. A common next adjustment keeps the same robust sandwich structure but inflates it by a degrees-of-freedom factor.
[definition: HC1 Covariance Estimator]
The HC1 covariance estimator is the degrees-of-freedom adjusted version
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{HC1}}(\hat\beta_n)
:= \frac{n}{n-p}\widehat{\operatorname{Var}}_{\mathrm{HC0}}(\hat\beta_n).
\end{align*}
[/definition]
HC1 has the same large-sample limit as HC0 because $n/(n-p)\to 1$ when $p$ is fixed. Its role is a finite-sample adjustment, analogous in spirit to using $n-p$ rather than $n$ when estimating a common error variance.
The remaining question is whether these robust formulas estimate the right sampling variance under heteroskedasticity, not merely whether they are plausible algebraic corrections. The obstruction is that the residuals used in the middle matrix are computed after fitting the model, so they are neither the true errors nor independent of the fitted regressors. A consistency result is needed to justify replacing the unknown heteroskedastic variance structure by residual-based quantities in large samples.
[quotetheorem:4449]
[citeproof:4449]
The fourth-moment assumptions are sufficient regularity conditions, not decorative technicalities: without enough tail control, a few observations with very large residuals or leverage can dominate the middle matrix. The theorem does not say that heteroskedasticity is harmless for prediction or model specification; it says that the covariance estimate for OLS can remain valid under unequal conditional variances. This is the bridge from the population normal approximation to the practical standard errors reported in regression tables.
[example: Classical Versus Robust Confidence Intervals]
Consider the education coefficient in the wage regression, and let $e_{\operatorname{educ}}=(0,1,0)^\top$ select the education slope from $\hat\beta_n$. Since $z_{0.975}\approx 1.96$ for the standard normal distribution, the classical $95\%$ interval is
\begin{align*}
e_{\operatorname{educ}}^\top\hat\beta_n
\pm
1.96\sqrt{e_{\operatorname{educ}}^\top\widehat{\operatorname{Var}}_{\mathrm{classical}}(\hat\beta_n)e_{\operatorname{educ}}},
\end{align*}
where, under the common-variance formula,
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{classical}}(\hat\beta_n)
&=\hat\sigma^2(X^\top X)^{-1},\\
\hat\sigma^2
&=\frac{1}{n-p}\sum_{i=1}^n \hat u_i^2.
\end{align*}
Therefore the squared classical standard error for the education slope is
\begin{align*}
\widehat{\operatorname{se}}_{\mathrm{classical}}(\hat\beta_{\operatorname{educ}})^2
&=e_{\operatorname{educ}}^\top\hat\sigma^2(X^\top X)^{-1}e_{\operatorname{educ}}\\
&=\left(\frac{1}{n-p}\sum_{i=1}^n \hat u_i^2\right)
e_{\operatorname{educ}}^\top(X^\top X)^{-1}e_{\operatorname{educ}}.
\end{align*}
The HC1 robust interval uses the same centre $e_{\operatorname{educ}}^\top\hat\beta_n=\hat\beta_{\operatorname{educ}}$, but replaces the standard error by
\begin{align*}
\widehat{\operatorname{se}}_{\mathrm{HC1}}(\hat\beta_{\operatorname{educ}})
=
\sqrt{
e_{\operatorname{educ}}^\top
\widehat{\operatorname{Var}}_{\mathrm{HC1}}(\hat\beta_n)
e_{\operatorname{educ}}
}.
\end{align*}
Using the HC1 formula,
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{HC1}}(\hat\beta_n)
&=\frac{n}{n-p}(X^\top X)^{-1}
\left(\sum_{i=1}^n \hat u_i^2X_iX_i^\top\right)
(X^\top X)^{-1},
\end{align*}
so the squared robust standard error is
\begin{align*}
\widehat{\operatorname{se}}_{\mathrm{HC1}}(\hat\beta_{\operatorname{educ}})^2
&=
\frac{n}{n-p}
e_{\operatorname{educ}}^\top(X^\top X)^{-1}
\left(\sum_{i=1}^n \hat u_i^2X_iX_i^\top\right)
(X^\top X)^{-1}e_{\operatorname{educ}}\\
&=
\frac{n}{n-p}
\sum_{i=1}^n
\hat u_i^2
\left(e_{\operatorname{educ}}^\top(X^\top X)^{-1}X_i\right)^2.
\end{align*}
Thus observations with large residual squares $\hat u_i^2$ and large education-direction weights $\left(e_{\operatorname{educ}}^\top(X^\top X)^{-1}X_i\right)^2$ contribute more to the robust education standard error. When high-education observations carry both larger residual variation and high leverage in this direction, the HC1 interval is wider because its half-length is $1.96\,\widehat{\operatorname{se}}_{\mathrm{HC1}}(\hat\beta_{\operatorname{educ}})$ rather than $1.96\,\widehat{\operatorname{se}}_{\mathrm{classical}}(\hat\beta_{\operatorname{educ}})$.
[/example]
A robust interval should be interpreted as large-sample inference under heteroskedasticity, not as a cure for all modelling problems. If the conditional mean is misspecified, the interval is about the best linear projection coefficient rather than a structural causal effect.
## Clustered Standard Errors
The final problem is dependence. Many data sets contain observations that are not independent within groups: students within schools, workers within firms, patients within hospitals, or counties within states. Treating all observations as independent can understate uncertainty because within-group shocks make many rows move together.
[definition: Clustered Data]
Observations are clustered when the sample is partitioned into groups $g=1,\dots,G$, with observations indexed by $i=1,\dots,n_g$ inside group $g$, and dependence is allowed within each group but groups are independent of one another.
[/definition]
The independence unit is now the cluster, not the individual row. As a result, the central limit theorem is applied to sums over clusters.
[definition: Cluster Score]
For a linear regression with regressors $X_{gi}\in\mathbb R^p$ and residuals $u_{gi}$, the population cluster score is
\begin{align*}
S_g := \sum_{i=1}^{n_g} X_{gi}u_{gi}.
\end{align*}
The sample cluster score replaces $u_{gi}$ by $\hat u_{gi}$:
\begin{align*}
\hat S_g := \sum_{i=1}^{n_g} X_{gi}\hat u_{gi}.
\end{align*}
[/definition]
The robust middle matrix must now estimate the variance of $\sum_g S_g$, not the sum of individual variances. Cross-products within a cluster are retained, which is precisely what row-by-row HC estimators omit.
[definition: Cluster Robust Covariance Estimator]
Let $X$ be the full design matrix and let $\hat S_g=\sum_{i=1}^{n_g}X_{gi}\hat u_{gi}$. The cluster robust covariance estimator is
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)
:= (X^\top X)^{-1}\left(\sum_{g=1}^G \hat S_g\hat S_g^\top\right)(X^\top X)^{-1}.
\end{align*}
[/definition]
Many software packages multiply this expression by finite-sample correction factors depending on $G$, $n$, and $p$. These corrections do not alter the main asymptotic idea: the middle matrix is built from cluster-level score sums.
To use the cluster formula for standard errors, we need conditions under which the cluster score sums obey a central limit approximation and the displayed matrix consistently estimates their variance. The theorem below makes explicit that the asymptotics are driven by the number of independent clusters.
[quotetheorem:4450]
[citeproof:4450]
The theorem requires many independent clusters, not merely many observations: with five states and thousands of counties, the independent sampling units are still the five states. Positive definiteness of $Q$ rules out cluster-level designs with no usable variation in the regressors, such as a policy indicator that is constant across all clusters. The theorem also does not justify arbitrary clustering choices; it connects the covariance estimator to the sampling approximation in which independent cluster scores satisfy a central limit theorem.
[example: County Policy Data Clustered by State]
Suppose $Y_{gi}$ is unemployment growth for county $i$ in state $g$, and let $D_g$ be a state-level policy indicator included in $X_{gi}$. The fitted residual for county $i$ in state $g$ is $\hat u_{gi}=Y_{gi}-X_{gi}^\top\hat\beta$, so the county score contribution is $X_{gi}\hat u_{gi}$. Clustering by state replaces the row-level middle matrix with the sum of state score outer products:
\begin{align*}
\hat S_g
&=\sum_{i=1}^{n_g}X_{gi}\hat u_{gi},\\
\sum_{g=1}^G \hat S_g\hat S_g^\top
&=\sum_{g=1}^G
\left(\sum_{i=1}^{n_g}X_{gi}\hat u_{gi}\right)
\left(\sum_{j=1}^{n_g}X_{gj}\hat u_{gj}\right)^\top \\
&=\sum_{g=1}^G\sum_{i=1}^{n_g}\sum_{j=1}^{n_g}
X_{gi}\hat u_{gi}\hat u_{gj}X_{gj}^\top.
\end{align*}
The terms with $i=j$ are
\begin{align*}
\sum_{g=1}^G\sum_{i=1}^{n_g}
\hat u_{gi}^2X_{gi}X_{gi}^\top,
\end{align*}
which are the row-level heteroskedasticity terms. The additional terms with $i\ne j$ are
\begin{align*}
\sum_{g=1}^G\sum_{\substack{i,j=1\\ i\ne j}}^{n_g}
X_{gi}\hat u_{gi}\hat u_{gj}X_{gj}^\top,
\end{align*}
and these are exactly the within-state cross-products that allow county errors in the same state to move together.
The cluster-robust covariance estimator is therefore
\begin{align*}
\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)
&=(X^\top X)^{-1}
\left(\sum_{g=1}^G\hat S_g\hat S_g^\top\right)
(X^\top X)^{-1}.
\end{align*}
If the policy indicator is constant within state, then $D_{gi}=D_g$ for every county $i$ in state $g$, so the policy part of the cluster score accumulates residuals at the state level:
\begin{align*}
\sum_{i=1}^{n_g}D_{gi}\hat u_{gi}
&=\sum_{i=1}^{n_g}D_g\hat u_{gi}
=D_g\sum_{i=1}^{n_g}\hat u_{gi}.
\end{align*}
Thus clustering treats the independent variation in the policy coefficient as coming from the $G$ state-level score sums, not from the total number of counties. When counties in the same state have positively correlated residuals and policy variation is mainly across states, the within-state cross-products increase the middle matrix, so the state-clustered standard error can be much larger than the row-level HC1 standard error.
[/example]
The example also points to a boundary of the idea. The following remark, Choosing the Clustering Level, records that interpretation before the construction is used again.
[remark: Choosing the Clustering Level]
The clustering level should match the level at which unmodelled shocks are plausibly shared and the level at which treatment or policy variation is assigned. Clustering below the dependence level leaves residual correlation unaccounted for. Clustering above the dependence level may be conservative, but it can also reduce the effective number of clusters.
[/remark]
Clustered standard errors are not a substitute for modelling the dependence structure when the dependence itself is the object of interest. Their purpose here is narrower: to make regression coefficient inference less sensitive to arbitrary within-cluster correlation.
## Wald Inference With Robust Covariance Matrices
The last step is to translate covariance estimators into tests and intervals. Once a consistent covariance estimate is available, the large-sample machinery is the same for classical, heteroskedasticity-robust, and cluster-robust inference.
[definition: Robust Standard Error]
Given a covariance estimator $\widehat{\operatorname{Var}}(\hat\beta)$, the robust standard error for coefficient $\hat\beta_j$ is
\begin{align*}
\widehat{\operatorname{se}}(\hat\beta_j)
:= \sqrt{\left(\widehat{\operatorname{Var}}(\hat\beta)\right)_{jj}}.
\end{align*}
[/definition]
The standard error is the estimated standard deviation of the sampling distribution of the coefficient. Its construction changes with the covariance estimator, but its role in inference stays the same.
Once a coefficient has been paired with an estimated standard error, the practical inference question is whether the resulting studentised statistic has a standard normal limit. The following result records that generic Wald step, separating it from the particular covariance estimator used to build the denominator.
[quotetheorem:4451]
[citeproof:4451]
The assumption of studentised convergence is the substantive input: it can fail if the covariance estimator is inconsistent, if the target linear combination has zero asymptotic variance, or if the relevant central limit theorem is a poor approximation. The result does not say which covariance estimator should be used; that choice is made by the sampling structure, such as heteroskedastic independent rows or many independent clusters. This explains why the same regression table can report classical, HC1, or clustered standard errors while keeping the coefficient column unchanged: the coefficient estimates the target linear parameter, while the standard error estimates uncertainty under a chosen large-sample approximation.
[example: Testing a Policy Coefficient With Clustered Standard Errors]
In the county policy regression, let $a$ be the vector that selects the policy coefficient, so $a^\top\hat\beta=\hat\beta_{\mathrm{policy}}$. To test the null hypothesis $a^\top\beta_0=0$ with state-clustered standard errors, the clustered statistic is
\begin{align*}
t_{\mathrm{CR}}
&=
\frac{a^\top\hat\beta}{\sqrt{a^\top\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)a}}.
\end{align*}
Using the cluster-robust covariance formula,
\begin{align*}
a^\top\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)a
&=
a^\top
(X^\top X)^{-1}
\left(\sum_{g=1}^G \hat S_g\hat S_g^\top\right)
(X^\top X)^{-1}
a \\
&=
\sum_{g=1}^G
a^\top(X^\top X)^{-1}\hat S_g\hat S_g^\top(X^\top X)^{-1}a \\
&=
\sum_{g=1}^G
\left(a^\top(X^\top X)^{-1}\hat S_g\right)
\left(\hat S_g^\top(X^\top X)^{-1}a\right) \\
&=
\sum_{g=1}^G
\left(a^\top(X^\top X)^{-1}\hat S_g\right)^2.
\end{align*}
Since $\hat S_g=\sum_{i=1}^{n_g}X_{gi}\hat u_{gi}$, each squared cluster contribution expands as
\begin{align*}
\left(a^\top(X^\top X)^{-1}\hat S_g\right)^2
&=
\left(
\sum_{i=1}^{n_g}
a^\top(X^\top X)^{-1}X_{gi}\hat u_{gi}
\right)^2 \\
&=
\sum_{i=1}^{n_g}\sum_{j=1}^{n_g}
\left(a^\top(X^\top X)^{-1}X_{gi}\right)
\left(a^\top(X^\top X)^{-1}X_{gj}\right)
\hat u_{gi}\hat u_{gj}.
\end{align*}
The terms with $i=j$ are the row-level residual-square terms, while the terms with $i\ne j$ are within-state residual cross-products. Thus positive within-state residual correlation adds positive cross-products in the policy direction whenever the corresponding weights have the same sign, increasing
\begin{align*}
a^\top\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)a
\end{align*}
relative to a row-level robust calculation that keeps only individual residual-square contributions.
Therefore the clustered denominator
\begin{align*}
\sqrt{a^\top\widehat{\operatorname{Var}}_{\mathrm{CR}}(\hat\beta)a}
\end{align*}
may be much larger than the HC1 denominator. A policy coefficient that is large relative to the row-level robust standard error may no longer be large relative to the state-clustered standard error, because the clustered calculation treats the independent score variation as coming from the $G$ state sums rather than from all counties separately.
[/example]
The chapter's message is that robust inference is a covariance problem, not a new estimator for the regression coefficients. Random-design asymptotics give the normal approximation, White's sandwich estimator handles unknown heteroskedasticity, and cluster-robust covariance estimators handle arbitrary dependence inside many independent groups.
Large-sample theory extends the regression toolkit beyond exact Gaussian calculations by replacing distributional assumptions with consistency and asymptotic normality. With that inferential machinery in place, we can now shift from theory to modelling practice and ask how specification choices affect interpretation.
# 7. Model Building and Interpretation
This chapter turns the regression theory developed so far into modelling practice. Earlier chapters treated least squares and inference as mathematical procedures; here the emphasis is on how choices of scale, controls, transformations, and selection criteria change the question answered by the fitted model. The guiding point is that a regression coefficient is not an intrinsic property of a variable. It is a property of a model, a unit of measurement, and an interpretation regime.
## Scaling, Centering, and Transformed Variables
How should a fitted coefficient be read when the same variable can be measured in dollars, thousands of dollars, logarithms, or deviations from its mean? The numerical value of a coefficient depends on coding choices, while fitted values and substantive conclusions may or may not change. Scaling and centering are therefore not cosmetic operations: they decide which comparisons are easy to express and which interactions have stable interpretations.
[definition: Centered Covariate]
Let $1_n=(1,\dots,1)\in\mathbb R^n$. The centering transformation is the map $T_c:\mathbb R^n\to\mathbb R^n$ defined by
\begin{align*}
T_c(x)=x-\bar x\,1_n, \qquad \bar x=\frac{1}{n}\sum_{i=1}^n x_i.
\end{align*}
For a real-valued covariate $X$ with sample vector $x=(X_1,\dots,X_n)$, the centered covariate $X^c$ has sample vector $T_c(x)$, so
\begin{align*}
X_i^c = X_i - \bar{X}, \qquad \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
[/definition]
Centering changes the meaning of the intercept: the intercept becomes the fitted response at the sample-average value of the centered covariate. For models with interactions, centering also makes main effects correspond to slopes at average levels of the interacting variables.
Sometimes the goal is not only to move the zero point but also to express covariates in comparable units. Standardization combines centering with division by the sample standard deviation, so a one-unit change means a one-standard-deviation change in the original covariate.
[definition: Standardized Covariate]
Let
\begin{align*}
\mathcal A_n=\{x\in\mathbb R^n: \sum_{i=1}^n (x_i-\bar x)^2>0\}.
\end{align*}
The standardization transformation is the map $T_s:\mathcal A_n\to\mathbb R^n$ defined by
\begin{align*}
T_s(x)=\frac{x-\bar x\,1_n}{s_x}, \qquad s_x=\left(\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2\right)^{1/2}.
\end{align*}
For a real-valued covariate $X$ with sample vector $x=(X_1,\dots,X_n)\in\mathcal A_n$, the standardized covariate $X^s$ has observations
\begin{align*}
X_i^s = \frac{X_i-\bar{X}}{s_X}.
\end{align*}
[/definition]
A standardized coefficient measures the fitted change in the response associated with a one sample-standard-deviation change in the covariate, holding the other included regressors fixed. If the response is standardized as well, the coefficient is measured in response standard deviations per covariate standard deviation.
[example: Standardizing Education in a Wage Regression]
Consider the fitted wage equation, with any included controls collected in $Z_i$,
\begin{align*}
\hat Y_i=\hat\alpha+2.40E_i+Z_i^\top\hat\gamma .
\end{align*}
The coefficient $2.40$ means that, holding $Z_i$ fixed, replacing $E_i$ by $E_i+1$ changes the fitted wage by
\begin{align*}
\left(\hat\alpha+2.40(E_i+1)+Z_i^\top\hat\gamma\right)
-\left(\hat\alpha+2.40E_i+Z_i^\top\hat\gamma\right)
=2.40,
\end{align*}
so the fitted difference is $2.40$ dollars per hour for one additional year of education.
Now replace education by the standardized covariate
\begin{align*}
E_i^s=\frac{E_i-\bar E}{2.5}.
\end{align*}
Solving for $E_i$ gives
\begin{align*}
2.5E_i^s=E_i-\bar E,
\qquad
E_i=\bar E+2.5E_i^s.
\end{align*}
Substituting this into the fitted equation yields
\begin{align*}
\hat Y_i
&=\hat\alpha+2.40(\bar E+2.5E_i^s)+Z_i^\top\hat\gamma\\
&=\hat\alpha+2.40\bar E+(2.40)(2.5)E_i^s+Z_i^\top\hat\gamma\\
&=(\hat\alpha+2.40\bar E)+6.00E_i^s+Z_i^\top\hat\gamma.
\end{align*}
Thus the standardized coefficient is $6.00$, while the adjusted intercept is $\hat\alpha+2.40\bar E$ and the fitted value $\hat Y_i$ is the same number as before for every observation. The unit of comparison has changed: a one-unit increase in $E_i^s$ is a $2.5$-year increase in education, so standardizing exposes the scale of the coefficient without changing the fitted projection.
[/example]
The preceding example suggests a general algebraic principle. Rescaling a covariate changes coordinates in the design matrix, so coefficient uniqueness must be separated from fitted-value uniqueness.
A general statement is useful because the same issue appears for any nonzero affine rescaling, not just for numerical standardization in one example. The result below identifies exactly which coefficient and intercept changes preserve all fitted values.
[quotetheorem:4452]
[citeproof:4452]
This result explains why standardized coefficients can be useful for scale comparison but do not create a new model fit. The full-rank hypothesis is needed because the theorem compares named coefficients, not only fitted values: if $X$ is an exact linear combination of the intercept and $Z$, the projection is still well-defined but the coefficient assigned to $X$ is not unique. The condition $b\ne0$ is also essential, since collapsing a covariate to a constant destroys the information needed to recover the original slope. The theorem does not say that all transformations preserve interpretation; nonlinear transformations such as logarithms change the substantive comparison, so they require their own interpretation rules.
[definition: Log-Level Regression]
A log-level regression is a linear regression in which the response is $\log Y$ and at least one regressor enters in levels:
\begin{align*}
\log Y_i = \beta_0 + \beta_1 X_i + u_i.
\end{align*}
[/definition]
This form is useful when the response is positive and proportional changes in the response are more meaningful than level changes.
A different interpretation is needed when the covariate itself is also measured proportionally. If both variables are positive and the scientific question compares percentage changes in inputs with percentage changes in outcomes, a level-unit slope is no longer the natural summary. Logging both sides creates the specification whose coefficient is read as an elasticity rather than a semi-elasticity.
[definition: Log-Log Regression]
A log-log regression is a linear regression in which both the response and a positive regressor enter through logarithms:
\begin{align*}
\log Y_i = \beta_0 + \beta_1 \log X_i + u_i.
\end{align*}
[/definition]
The log-level coefficient is approximately a proportional change in $Y$ for a one-unit change in $X$. The log-log coefficient is an elasticity: it is the fitted proportional change in $Y$ associated with a proportional change in $X$.
[example: Log-Level and Log-Log Interpretations]
In the log-level model $\log W_i=\beta_0+0.08E_i+u_i$, compare two fitted log wages that differ by one year of education while all other terms are held fixed. The fitted log-wage difference is
\begin{align*}
(\beta_0+0.08(E_i+1))-(\beta_0+0.08E_i)
&=\beta_0+0.08E_i+0.08-\beta_0-0.08E_i\\
&=0.08.
\end{align*}
Exponentiating turns a difference in fitted log wages into a ratio of fitted wage levels:
\begin{align*}
\frac{\exp(\beta_0+0.08(E_i+1))}{\exp(\beta_0+0.08E_i)}
&=\exp\left((\beta_0+0.08(E_i+1))-(\beta_0+0.08E_i)\right)\\
&=e^{0.08}.
\end{align*}
Using $e^{0.08}\approx 1.0833$, the fitted wage factor is about $1.0833$, so the fitted percent increase is
\begin{align*}
100(e^{0.08}-1)\approx 100(1.0833-1)=8.33\%.
\end{align*}
The common $8\%$ statement uses the first-order approximation $e^c-1\approx c$ when $c$ is small.
In the log-log model $\log W_i=\beta_0+0.35\log H_i+u_i$, a $1\%$ increase in hours replaces $H_i$ by $1.01H_i$. The fitted log-wage difference is
\begin{align*}
(\beta_0+0.35\log(1.01H_i))-(\beta_0+0.35\log H_i)
&=0.35(\log(1.01H_i)-\log H_i)\\
&=0.35(\log 1.01+\log H_i-\log H_i)\\
&=0.35\log 1.01.
\end{align*}
Thus the fitted wage ratio is
\begin{align*}
\exp(0.35\log 1.01)=(1.01)^{0.35},
\end{align*}
and the exact fitted percent change is
\begin{align*}
100\left((1.01)^{0.35}-1\right)\approx 100(1.00349-1)=0.349\%.
\end{align*}
So the coefficient $0.35$ is an elasticity: a small proportional increase in hours is associated with about $0.35$ times that proportional increase in fitted wage, conditional on the model covariates.
[/example]
Logarithmic transformations change the scale on which slopes are constant. Interactions address a different modelling issue: the possibility that the slope of one covariate is not constant across levels of another covariate.
To represent that kind of effect modification in an ordinary linear regression, the design matrix must include a new regressor built from the variables whose joint role matters. This motivates the formal definition of an interaction term.
[definition: Interaction Term]
Given two covariates $X$ and $Z$, an interaction term is the product covariate $XZ$ included together with specified lower-order terms in a regression model.
[/definition]
Interaction terms allow the slope with respect to one covariate to depend on the level of another. In the model
\begin{align*}
Y_i = \beta_0+\beta_1X_i+\beta_2Z_i+\beta_3X_iZ_i+u_i,
\end{align*}
the fitted change in $Y$ for a one-unit change in $X$ at fixed $Z=z$ is $\beta_1+\beta_3z$. For this reason, the coefficient $\beta_1$ is not the overall effect of $X$; it is the slope in $X$ when $Z=0$.
[example: Centering an Interaction]
Suppose the fitted model is
\begin{align*}
\hat Y_i=\hat\beta_0+\hat\beta_EE_i+\hat\beta_AA_i+\hat\beta_{EA}E_iA_i.
\end{align*}
At a fixed experience level $A_i=a$, increasing education from $E_i=e$ to $E_i=e+1$ changes the fitted value by
\begin{align*}
&\left(\hat\beta_0+\hat\beta_E(e+1)+\hat\beta_Aa+\hat\beta_{EA}(e+1)a\right)
-\left(\hat\beta_0+\hat\beta_Ee+\hat\beta_Aa+\hat\beta_{EA}ea\right)\\
&=\hat\beta_Ee+\hat\beta_E+\hat\beta_{EA}ea+\hat\beta_{EA}a-\hat\beta_Ee-\hat\beta_{EA}ea\\
&=\hat\beta_E+\hat\beta_{EA}a.
\end{align*}
Thus the education slope at experience level $a$ is $\hat\beta_E+\hat\beta_{EA}a$.
Now center experience by writing
\begin{align*}
A_i^c=A_i-\bar A,
\qquad
A_i=A_i^c+\bar A.
\end{align*}
Substituting $A_i=A_i^c+\bar A$ into the fitted model gives
\begin{align*}
\hat Y_i
&=\hat\beta_0+\hat\beta_EE_i+\hat\beta_A(A_i^c+\bar A)
+\hat\beta_{EA}E_i(A_i^c+\bar A)\\
&=\hat\beta_0+\hat\beta_EE_i+\hat\beta_AA_i^c+\hat\beta_A\bar A
+\hat\beta_{EA}E_iA_i^c+\hat\beta_{EA}\bar A E_i\\
&=(\hat\beta_0+\hat\beta_A\bar A)
+(\hat\beta_E+\hat\beta_{EA}\bar A)E_i
+\hat\beta_AA_i^c+\hat\beta_{EA}E_iA_i^c.
\end{align*}
So in the centered specification, the coefficient on $E_i$ is $\hat\beta_E+\hat\beta_{EA}\bar A$, which is exactly the original education slope evaluated at $A_i=\bar A$. The interaction coefficient remains $\hat\beta_{EA}$, while the education main effect now reports the education comparison at average experience rather than at zero experience.
[/example]
## Causal and Predictive Interpretation
When does a coefficient describe the effect of changing a variable, and when does it only improve prediction? A predictive model asks for accurate fitted or future values. A causal model asks what would happen under an intervention. The same least squares computation can be used in both settings, but the interpretation requires different assumptions.
[definition: Predictive Interpretation]
A regression coefficient has a predictive interpretation when it is read as part of a rule for approximating $\mathbb E[Y\mid X]$ or for predicting new responses from observed covariates.
[/definition]
This interpretation is judged by prediction error rather than by whether changing a covariate would change the outcome. Causal language requires an additional target, namely the comparison between outcomes under different interventions.
[definition: Causal Interpretation]
A regression coefficient has a causal interpretation for a treatment variable $D$ when it is read as the change in the mean potential outcome caused by changing $D$, under assumptions connecting the observed regression to that intervention.
[/definition]
Prediction tolerates variables that are consequences, proxies, or correlates of the target, provided they are available at prediction time and improve out-of-sample accuracy. Causal interpretation is stricter: controlling for the wrong variable can remove part of the causal effect or introduce selection bias.
The first adjustment problem is omitted common causes: variables that make the treatment and outcome move together even when the treatment itself is not the whole explanation. Naming these variables clarifies why some controls are needed for causal interpretation.
[definition: Confounder]
A confounder for the relationship between a treatment $D$ and an outcome $Y$ is a variable $C$ associated with both $D$ and $Y$ whose omission changes the association between $D$ and $Y$ away from the desired causal comparison.
[/definition]
Not every variable associated with both treatment and outcome should be controlled for. The timing and causal role of the variable determine whether adjustment clarifies or distorts the comparison.
The complementary danger is over-adjustment. A regression can look more complete after adding a variable, while the added regressor actually blocks part of the treatment effect or conditions on a selection path. To separate useful adjustment variables from harmful ones, we need a name for controls whose inclusion changes the causal comparison itself.
[definition: Bad Control]
A bad control is a regressor included in a causal regression that is affected by the treatment, or is otherwise on a path whose conditioning distorts the causal comparison of interest.
[/definition]
Controls should be chosen by the causal question, not by the mechanical desire to maximize $R^2$. Pre-treatment common causes are natural candidates for adjustment; post-treatment variables usually require special justification.
[quotetheorem:4453]
[citeproof:4453]
The formula separates two ingredients: relevance of the omitted variable for the outcome, measured by $\beta_2$, and imbalance of the omitted variable across values of $X$, measured by $\pi_1$. If either ingredient is absent, omission does not change the population slope on $X$. The conditional mean assumptions are necessary for interpreting the displayed coefficient as the population regression coefficient generated by this decomposition; if $u$ is still related to $X$ after conditioning on $Z$, the remaining error contributes additional bias not captured by $\beta_2\pi_1$. A simple failure case is an omitted variable $Z$ that is uncorrelated with $X$ but a remaining unobserved factor $W$ that affects both $X$ and $Y$; the formula gives no bias from omitting $Z$, while the regression of $Y$ on $X$ is still not causal. Thus the formula is an accounting identity for one omitted regressor under a linear projection structure, not a complete test for causal validity.
[example: Ability Bias in an Education Regression]
Let $Y_i$ be log wage, $X_i$ years of education, and $Z_i$ an unobserved measure of ability. Suppose the population wage equation and the population linear projection of ability on education are
\begin{align*}
Y_i &= \beta_0+\beta_1X_i+\beta_2Z_i+u_i,\\
Z_i &= \pi_0+\pi_1X_i+v_i.
\end{align*}
Substituting the second equation into the first gives
\begin{align*}
Y_i
&=\beta_0+\beta_1X_i+\beta_2(\pi_0+\pi_1X_i+v_i)+u_i\\
&=\beta_0+\beta_1X_i+\beta_2\pi_0+\beta_2\pi_1X_i+\beta_2v_i+u_i\\
&=(\beta_0+\beta_2\pi_0)+(\beta_1+\beta_2\pi_1)X_i+(\beta_2v_i+u_i).
\end{align*}
Thus, by the *Omitted Variable Bias Formula*, the population coefficient from regressing log wage on education alone is
\begin{align*}
\tilde\beta_1=\beta_1+\beta_2\pi_1.
\end{align*}
The bias relative to the controlled coefficient $\beta_1$ is therefore
\begin{align*}
\tilde\beta_1-\beta_1
&=(\beta_1+\beta_2\pi_1)-\beta_1\\
&=\beta_2\pi_1.
\end{align*}
If ability raises wages, then $\beta_2>0$; if higher-ability individuals tend to obtain more education, then $\pi_1>0$. Multiplying the two positive factors gives
\begin{align*}
\beta_2\pi_1>0,
\end{align*}
so $\tilde\beta_1-\beta_1>0$: the education-only regression overstates the coefficient on education. Credible pre-treatment controls such as family background or prior test scores may reduce this source of bias by accounting for ability-related differences before education is chosen, while adding occupation after education may be a bad control if occupation lies partly on the pathway from education to wages.
[/example]
The phrase "holding controls fixed" can be misleading if it sounds like physically freezing covariates. Algebraically, the issue is how to isolate the part of $X$ that is not already explained by the control matrix $Z$, and the part of $Y$ left after the same controls are removed. The result below states the exact residualization operation that makes this interpretation valid for least squares coefficients.
[quotetheorem:4454]
[citeproof:4454]
The theorem says that controlling for $Z$ can be understood as first removing from both $Y$ and $X$ the linear information explained by $Z$, then regressing the residualized outcome on the residualized covariates of interest. Full column rank is needed because otherwise the residualized columns $M_ZX$ may be linearly dependent, so the coefficient vector on $X$ is not uniquely determined even though the fitted values are. For example, if one column of $X$ is already in the span of $Z$, residualizing makes that column zero and there is no separate slope left to estimate for it. The theorem is algebraic rather than causal: residualizing a bad control still computes the corresponding least squares coefficient, but it does not make the resulting comparison a valid intervention effect. This residualization view prepares the model-selection discussion, where adding variables changes both predictive fit and the meaning of coefficients.
[example: Partialling Out Controls in a Wage Equation]
Let $Y_i=\log W_i$, let $X_i$ be years of education, and let the rows of $Z$ contain the intercept, age, region indicators, and gender indicators. Write $Y=(Y_1,\dots,Y_n)^\top$, $X=(X_1,\dots,X_n)^\top$, and let
\begin{align*}
M_Z=I-Z(Z^\top Z)^{-1}Z^\top
\end{align*}
be the residual-maker matrix for the controls. The residuals from regressing log wage on the controls are
\begin{align*}
r^Y=M_ZY,
\end{align*}
and the residuals from regressing education on the same controls are
\begin{align*}
r^X=M_ZX.
\end{align*}
The slope from regressing $r^Y$ on $r^X$ is
\begin{align*}
\hat b
&=\frac{(r^X)^\top r^Y}{(r^X)^\top r^X}\\
&=\frac{(M_ZX)^\top(M_ZY)}{(M_ZX)^\top(M_ZX)}\\
&=\frac{X^\top M_Z^\top M_ZY}{X^\top M_Z^\top M_ZX}.
\end{align*}
Since $M_Z^\top=M_Z$ and $M_Z^2=M_Z$, this becomes
\begin{align*}
\hat b
&=\frac{X^\top M_ZY}{X^\top M_ZX}\\
&=(X^\top M_ZX)^{-1}X^\top M_ZY,
\end{align*}
where the last line uses that $X^\top M_ZX$ is a nonzero scalar when education has variation left after removing the controls. By the *Frisch-Waugh-Lovell Theorem*, this is exactly the coefficient on education in the full regression of log wage on education and all controls. Thus the education coefficient measures the relationship between the part of log wage not linearly explained by the controls and the part of education not linearly explained by those same controls.
[/example]
## Model Selection Criteria and Prediction Error
How should we choose among competing models when adding variables almost always improves in-sample fit? The central difficulty is that training error rewards flexibility even when the added terms are fitting noise. Model selection criteria and validation methods try to estimate predictive performance while penalizing unnecessary complexity.
[definition: Adjusted Coefficient of Determination]
For a least squares regression with $n$ observations, $p$ fitted coefficients including the intercept, residual sum of squares $RSS$, and total sum of squares $TSS$, the adjusted coefficient of determination is
\begin{align*}
\bar R^2 = 1-\frac{RSS/(n-p)}{TSS/(n-1)}.
\end{align*}
[/definition]
Unlike ordinary $R^2$, adjusted $R^2$ can decrease when a regressor is added. It rewards fit only when the reduction in residual sum of squares is large enough relative to the lost degree of freedom.
For likelihood-based models, sums of squares may not be the common currency for comparing specifications. The same overfitting problem appears on the likelihood scale: maximizing likelihood alone favors extra parameters. A criterion is therefore needed that keeps the likelihood comparison but charges a penalty for model dimension.
[definition: Akaike Information Criterion]
For a fitted parametric model with maximized likelihood $\hat L$ and $p$ fitted parameters, the Akaike information criterion is
\begin{align*}
AIC = -2\log \hat L + 2p.
\end{align*}
[/definition]
AIC uses a fixed penalty of two units per parameter, so it is relatively willing to keep variables when they improve likelihood.
A fixed penalty can be too permissive when the sample size is large enough for small likelihood gains to accumulate across many candidate variables. One response is to make the complexity charge grow with $n$, so that the criterion demands stronger evidence before retaining additional parameters in larger samples.
[definition: Bayesian Information Criterion]
For a fitted parametric model with maximized likelihood $\hat L$, $p$ fitted parameters, and sample size $n$, the Bayesian information criterion is
\begin{align*}
BIC = -2\log \hat L + p\log n.
\end{align*}
[/definition]
AIC and BIC have the same goodness-of-fit term but different complexity penalties. BIC penalizes parameters more heavily than AIC once $n>e^2$, so it often selects smaller models in moderate or large samples.
Information criteria estimate predictive adequacy indirectly through fitted likelihood and a penalty formula. When prediction is the main goal, we need a measurement that asks how the fitted rule performs away from the data that chose its coefficients. The basic object is therefore the error on an independent held-out sample, where overfitting cannot be hidden by reusing the training responses.
[definition: Train-Test Error]
Given a training sample used to fit a prediction rule $\hat f$ and an independent test sample $(X_i^{\mathrm{test}},Y_i^{\mathrm{test}})_{i=1}^N$, the test mean squared error is
\begin{align*}
\frac{1}{N}\sum_{i=1}^N \left(Y_i^{\mathrm{test}}-\hat f(X_i^{\mathrm{test}})\right)^2.
\end{align*}
[/definition]
A single train-test split is conceptually clean but can be noisy when the dataset is small. The measured test error may depend heavily on which observations happened to be held out, and reserving a large test set leaves less data for fitting.
The obstruction is that assessment needs independence from the fitted rule, while efficient fitting wants to use as many observations as possible. $K$-fold cross-validation resolves this tension by rotating the held-out role across several folds, so each observation can be tested against a model that did not use it while still letting most of the data contribute to each fit.
[definition: K-Fold Cross-Validation]
$K$-fold cross-validation partitions the data into $K$ disjoint folds, fits the model $K$ times leaving out one fold each time, and averages the prediction loss on the held-out folds.
[/definition]
Cross-validation estimates the error of the whole modelling procedure, including transformations, selected degree, and tuning choices, provided those choices are repeated inside each training fold. If feature selection is done before cross-validation on the full data, the validation error is usually too optimistic.
To understand why validation error has to be measured on data not used for fitting, it helps to decompose prediction risk at a fresh observation. The key obstruction is that the same flexibility that reduces training error can make the fitted rule sensitive to the particular sample, creating variance that is invisible from the training residuals alone.
[quotetheorem:4455]
[citeproof:4455]
This decomposition explains why the best predictive model is not necessarily the largest available model. More flexible models may reduce bias but increase variance; simpler models may have more bias but less sensitivity to the training sample. The independence of the test observation matters because training error reuses the same noise that helped fit $\hat f$, so it generally underestimates prediction error. The conditional mean-zero assumption separates signal from noise; if $\mathbb E[\varepsilon_0\mid X_0]\ne0$, then part of what is labelled irreducible noise should instead be absorbed into the regression function. The constant conditional variance assumption is what turns the noise contribution into the single number $\sigma^2$; with heteroskedastic noise it becomes $\mathbb E[\operatorname{Var}(\varepsilon_0\mid X_0)]$. Thus the decomposition is a guide to the bias-variance trade-off, not a guarantee that a particular validation protocol has been carried out correctly.
[example: Selecting Polynomial Degree by Cross-Validation]
Suppose the data are partitioned into $K$ disjoint folds $I_1,\dots,I_K$, and the candidate prediction rules are polynomial regressions of degrees $d=1,2,\dots,10$. For a fixed fold $k$ and degree $d$, fit the degree-$d$ model using only observations outside $I_k$:
\begin{align*}
\hat f_{d}^{(-k)}(x)
=\hat\alpha_{0,d}^{(-k)}+\hat\alpha_{1,d}^{(-k)}x+\hat\alpha_{2,d}^{(-k)}x^2+\cdots+\hat\alpha_{d,d}^{(-k)}x^d.
\end{align*}
The held-out squared-error contribution from fold $I_k$ is
\begin{align*}
CV_{k}(d)
=\frac{1}{|I_k|}\sum_{i\in I_k}\left(Y_i-\hat f_{d}^{(-k)}(X_i)\right)^2.
\end{align*}
Averaging across the $K$ folds gives the cross-validation score
\begin{align*}
CV(d)
&=\frac{1}{K}\sum_{k=1}^K CV_k(d)\\
&=\frac{1}{K}\sum_{k=1}^K
\frac{1}{|I_k|}\sum_{i\in I_k}\left(Y_i-\hat f_{d}^{(-k)}(X_i)\right)^2.
\end{align*}
The selected degree is therefore
\begin{align*}
\hat d \in \operatorname*{arg\,min}_{d\in\{1,\dots,10\}} CV(d).
\end{align*}
If, for instance,
\begin{align*}
CV(1)>CV(2)>CV(3),\qquad
CV(3)<CV(4)<\cdots<CV(10),
\end{align*}
then the minimizing degree is $\hat d=3$. In that case the linear and quadratic fits have larger held-out error than the cubic fit, while degrees above $3$ also have larger held-out error, so the cross-validation rule selects the intermediate degree rather than the simplest or most flexible candidate.
[/example]
The same selection step that improves estimated prediction error can complicate inference. Once the data have chosen a model, the usual standard errors no longer describe a procedure with a fixed design chosen in advance.
[remark: Selection Changes Inference]
Standard confidence intervals from the final selected regression treat the selected model as fixed in advance. If the same data were used to choose variables, transform covariates, and report inference, those intervals usually understate uncertainty. Valid post-selection inference requires either sample splitting, adjusted procedures, or a modelling protocol specified before looking at the outcome.
[/remark]
The chapter's main message is that regression modelling is an exercise in translating a statistical question into a design matrix and then translating the fitted coefficients back into the original question. Scaling and transformations alter units, interactions alter marginal meanings, controls alter the comparison group, and selection rules alter the target of reported performance. The algebra of least squares remains the same, but interpretation is earned by the modelling assumptions surrounding it.
Model-building choices change the meaning of a fitted regression even when the least squares algebra stays the same. The fitted model then becomes a candidate explanation that must be checked against the data, which is why diagnostics and influence analysis come next.
# 8. Diagnostics and Influence
This chapter asks what can go wrong after a linear model has been fitted. Chapters 2--7 developed estimation, inference, and model-building choices under assumptions about the mean function, error variance, and design matrix. Diagnostics turn the fitted model back on the data: they ask whether residuals behave like unexplained noise, whether individual observations dominate the fit, and whether the design itself makes coefficient estimates unstable.
The central theme is that regression output is not just a table of coefficients. A fitted linear model also produces residuals, leverages, deleted fits, and matrix summaries that reveal where the model is strained. These tools do not replace subject-matter judgement, but they give a disciplined way to locate nonlinearity, heteroskedasticity, outliers, influential points, and multicollinearity.
## Residual Plots and Model Failure
After fitting a regression, the first question is whether the errors left behind resemble unstructured noise. If a systematic pattern remains in the residuals, then a coefficient table can give precise answers to the wrong model.
Let $y \in \mathbb R^n$, let $X \in \mathbb R^{n \times p}$ have full column rank, and let
\begin{align*}
\hat{\beta} = (X^\top X)^{-1}X^\top y, \qquad \hat{y}=X\hat{\beta}, \qquad e=y-\hat{y}.
\end{align*}
Write $H=X(X^\top X)^{-1}X^\top$ for the hat matrix, so that $\hat{y}=Hy$ and $e=(I-H)y$. Since $H$ is an orthogonal projection onto $\operatorname{Range}(X)$, residuals are orthogonal to every fitted linear combination of the predictors.
[definition: Residual]
In the linear regression fit of $y \in \mathbb R^n$ on a full-rank design matrix $X \in \mathbb R^{n \times p}$, the residual vector is
\begin{align*}
e = y - X\hat{\beta}, \qquad \hat{\beta}=(X^\top X)^{-1}X^\top y.
\end{align*}
[/definition]
The $i$th residual is $e_i=y_i-\hat{y}_i$.
Residual plots are most useful when each plot is tied to a concrete failure mode. Plotting $e_i$ against the fitted value $\hat{y}_i$ checks whether the conditional mean is adequately linear. Plotting $|e_i|$ or $e_i^2$ against $\hat{y}_i$ or against a predictor checks whether the conditional variance changes across the design. Plotting residuals against time, spatial location, or collection order checks whether the independence assumption has been violated by dependence in the data-generating process.
[example: Anscombe Quartet]
Anscombe's quartet can be written as four data sets of size $n=11$. For the first three data sets the $x$-values are
\begin{align*}
10,8,13,9,11,14,6,4,12,7,5,
\end{align*}
while the fourth data set has
\begin{align*}
8,8,8,8,8,8,8,19,8,8,8.
\end{align*}
In each case, using the reported rounded values,
\begin{align*}
\bar{x}&=\frac{99}{11}=9,\\
\bar{y}&=\frac{82.51}{11}=7.5009\ldots,\\
S_{xx}&=\sum_{i=1}^{11}(x_i-\bar{x})^2=110,\\
S_{xy}&=\sum_{i=1}^{11}(x_i-\bar{x})(y_i-\bar{y})\approx 55.0.
\end{align*}
Therefore the least-squares slope and intercept are
\begin{align*}
\hat{\beta}_1
&=\frac{S_{xy}}{S_{xx}}
\approx \frac{55.0}{110}
\approx 0.50,\\
\hat{\beta}_0
&=\bar{y}-\hat{\beta}_1\bar{x}
\approx 7.5009-(0.50)(9)
\approx 3.00.
\end{align*}
So all four data sets give essentially the same fitted line,
\begin{align*}
\hat{y}\approx 3.00+0.50x.
\end{align*}
The residual sum of squares is also essentially the same. For the first data set, using $\hat{y}=3+0.5x$, the fitted values are
\begin{align*}
8,7,9.5,7.5,8.5,10,6,5,9,6.5,5.5,
\end{align*}
so the residuals are
\begin{align*}
0.04,-0.05,-1.92,1.31,-0.17,-0.04,1.24,-0.74,1.84,-1.68,0.18.
\end{align*}
Thus
\begin{align*}
\operatorname{RSS}
&=0.04^2+(-0.05)^2+(-1.92)^2+1.31^2+(-0.17)^2+(-0.04)^2\\
&\qquad +1.24^2+(-0.74)^2+1.84^2+(-1.68)^2+0.18^2\\
&=13.7627.
\end{align*}
Repeating the same subtraction and squaring for the other three rounded data sets gives residual sums of squares near $13.75$ as well.
The shared numerical summaries hide four different geometries. In the first data set the residuals fluctuate around the line without a strong pattern. In the second, the residuals are positive near the middle and negative toward the ends, showing curvature. In the third, one large vertical residual contributes most of the error scale. In the fourth, ten observations have $x=8$ and one observation has $x=19$, so the apparent positive slope is created by a single high-leverage point. Thus the coefficient table and residual sum of squares do not determine whether the fitted line is appropriate; the scatter plot and residual plot contain information that the numerical summaries suppress.
[/example]
A residual plot with a curved band indicates mean misspecification: the model is trying to fit a nonlinear conditional mean with a linear surface. A funnel-shaped residual plot points to a different failure mode: the average relationship may be adequate, while the conditional spread of the errors changes across the covariate space. This variance pattern needs its own name because it affects standard errors even when the fitted mean remains meaningful.
[definition: Heteroskedasticity]
In a regression model with errors $\varepsilon_i$ and design rows $x_i^\top$, heteroskedasticity means that
\begin{align*}
\operatorname{Var}(\varepsilon_i \mid X) = \sigma_i^2
\end{align*}
where the values $\sigma_1^2,\dots,\sigma_n^2$ are not all equal.
[/definition]
Heteroskedasticity does not by itself make the OLS fitted values meaningless. The main inferential issue is that the usual standard error formula based on a common variance $\sigma^2$ no longer estimates the right variance for $\hat{\beta}$.
[remark: Diagnostics Are Conditional]
A residual plot is interpreted relative to the fitted model and the chosen scale. A nonlinear pattern may disappear after adding a transformation or interaction, and a variance pattern may become weaker after modelling the response on a logarithmic scale. Diagnostics therefore guide model revision rather than delivering a mechanical accept-or-reject verdict.
[/remark]
## Leverage and Deleted Residuals
A point can be unusual in its response value, in its predictor values, or in both. The next question is how to separate a large vertical residual from an observation whose covariate vector gives it disproportionate control over the fitted surface.
[definition: Leverage]
For a full-rank design matrix $X \in \mathbb R^{n \times p}$ with hat matrix $H=X(X^\top X)^{-1}X^\top$, the leverage of observation $i$ is
\begin{align*}
h_{ii}=H_{ii}=x_i^\top (X^\top X)^{-1}x_i.
\end{align*}
[/definition]
The leverage values satisfy $0\le h_{ii}\le 1$ and $\sum_{i=1}^n h_{ii}=p$. Thus the average leverage is $p/n$, so values much larger than $p/n$ indicate observations far from the centre of the design in the metric induced by $(X^\top X)^{-1}$.
These facts should not be treated as separate algebraic coincidences. To use leverage as a diagnostic, we need the structural reason that every diagonal entry is bounded and that the total leverage equals the model dimension. That reason is the projection geometry of the hat matrix, which the next result isolates.
[quotetheorem:4456]
[citeproof:4456]
The full-column-rank hypothesis is what makes $X^\top X$ invertible and hence makes the hat matrix in this form well-defined. If two columns of $X$ are identical, for example, then $X^\top X$ is singular and the formula $X(X^\top X)^{-1}X^\top$ cannot be used without replacing the inverse by a generalized inverse and revisiting the interpretation of $p$. These projection identities also do not say by themselves whether an observation is influential: leverage measures potential influence through the design alone, not whether the response value pulls the fit. To measure actual influence, we also need residual size. A high-leverage observation sitting exactly on the fitted line may have small residual but still determine the location of the line; a low-leverage observation with a large residual may be an outlier without moving the fitted model much. This is why leverage enters the deletion formulas next, where it controls how much the fitted value at an observation changes when that observation is removed.
The leave-one-out fit removes observation $i$ and refits the model. Denote the resulting estimator, fitted value at $x_i$, and residual by
\begin{align*}
\hat{\beta}_{(i)}, \qquad \hat{y}_{i(i)}=x_i^\top \hat{\beta}_{(i)}, \qquad e_{i(i)}=y_i-\hat{y}_{i(i)}.
\end{align*}
These deleted quantities appear to require $n$ separate regressions, but the projection geometry gives closed formulas.
[quotetheorem:4457]
[citeproof:4457]
The condition $1-h_{ii}>0$ is not a technical decoration: if $h_{ii}=1$, then observation $i$ contributes a design direction not supplied by the other rows, and deleting it can make $X_{(i)}^\top X_{(i)}$ singular. For example, with $X=I_n$ every leverage is $1$, the full model interpolates all responses, and deleting one row leaves no information for the corresponding coordinate direction. The formula also has a limited scope: it describes exact least-squares deletion under a fixed full-rank linear design, not robust regression, weighted regression, or a model refitted after changing transformations or selected predictors. Within that scope, it explains why leverage amplifies deletion effects. If $h_{ii}$ is close to $1$, then the fitted value at observation $i$ is strongly tied to $y_i$, so the ordinary residual $e_i$ can understate how far $y_i$ is from the model fitted without that observation. This motivates studentized and deleted residual diagnostics, which rescale residuals by the variance left after accounting for leverage.
[definition: Studentized Residual]
Let $s^2=\operatorname{RSS}/(n-p)$ with $\operatorname{RSS}=\sum_{i=1}^n e_i^2$. The internally studentized residual is
\begin{align*}
r_i = \frac{e_i}{s\sqrt{1-h_{ii}}}.
\end{align*}
If $s_{(i)}^2$ is the residual variance estimate after deleting observation $i$, the externally studentized residual is
\begin{align*}
t_i = \frac{e_i}{s_{(i)}\sqrt{1-h_{ii}}}.
\end{align*}
[/definition]
Studentization puts residuals on a comparable scale. The factor $\sqrt{1-h_{ii}}$ matters because residuals at high-leverage observations have smaller conditional variance under the homoskedastic Gaussian linear model.
[example: High Leverage Point Changing a Regression Line]
Take six observations
\begin{align*}
(-2,0),\;(-1,0),\;(0,0),\;(1,0),\;(2,0),\;(20,20).
\end{align*}
With an intercept in the model, the least-squares slope is
\begin{align*}
\hat{\beta}_1
=\frac{\sum_{i=1}^6 (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^6 (x_i-\bar{x})^2}.
\end{align*}
Here
\begin{align*}
\bar{x}
&=\frac{-2-1+0+1+2+20}{6}
=\frac{20}{6}
=\frac{10}{3},\\
\bar{y}
&=\frac{0+0+0+0+0+20}{6}
=\frac{10}{3}.
\end{align*}
Using $\sum_i (x_i-\bar{x})^2=\sum_i x_i^2-6\bar{x}^2$ and the analogous centred cross-product identity,
\begin{align*}
\sum_{i=1}^6 (x_i-\bar{x})^2
&=\left(4+1+0+1+4+400\right)-6\left(\frac{10}{3}\right)^2\\
&=410-\frac{600}{9}
=410-\frac{200}{3}
=\frac{1030}{3},\\
\sum_{i=1}^6 (x_i-\bar{x})(y_i-\bar{y})
&=\sum_{i=1}^6 x_i y_i-6\bar{x}\bar{y}\\
&=400-6\left(\frac{10}{3}\right)\left(\frac{10}{3}\right)\\
&=400-\frac{600}{9}
=400-\frac{200}{3}
=\frac{1000}{3}.
\end{align*}
Therefore
\begin{align*}
\hat{\beta}_1
&=\frac{1000/3}{1030/3}
=\frac{1000}{1030}
=\frac{100}{103},\\
\hat{\beta}_0
&=\bar{y}-\hat{\beta}_1\bar{x}
=\frac{10}{3}-\frac{100}{103}\cdot \frac{10}{3}\\
&=\frac{10}{3}\left(1-\frac{100}{103}\right)
=\frac{10}{3}\cdot \frac{3}{103}
=\frac{10}{103}.
\end{align*}
Thus the fitted line with the far-right observation included is
\begin{align*}
\hat{y}=\frac{10}{103}+\frac{100}{103}x.
\end{align*}
At the far-right observation,
\begin{align*}
\hat{y}
&=\frac{10}{103}+\frac{100}{103}\cdot 20
=\frac{2010}{103},\\
e
&=20-\frac{2010}{103}
=\frac{2060-2010}{103}
=\frac{50}{103}
\approx 0.49.
\end{align*}
So the point has pulled the slope close to $1$ while leaving itself with only a small residual.
If the observation $(20,20)$ is deleted, the remaining data are
\begin{align*}
(-2,0),\;(-1,0),\;(0,0),\;(1,0),\;(2,0).
\end{align*}
For these five observations,
\begin{align*}
\bar{x}_{(6)}
&=\frac{-2-1+0+1+2}{5}
=0,\\
\bar{y}_{(6)}
&=0,
\end{align*}
and hence
\begin{align*}
\hat{\beta}_{1(6)}
&=\frac{\sum_{i=1}^5 (x_i-0)(0-0)}{\sum_{i=1}^5 (x_i-0)^2}
=\frac{0}{4+1+0+1+4}
=0,\\
\hat{\beta}_{0(6)}
&=0-0\cdot 0
=0.
\end{align*}
Deleting the far-right point therefore changes the fitted line from nearly $\hat{y}=x$ back to the flat line $\hat{y}=0$. The example shows that influence is not just residual size: the far-right point has unusual predictor value, and its response value supports a very different slope from the one supported by the rest of the data.
[/example]
## Influence Measures
Once a potentially unusual observation has been found, the practical question is how much the fitted regression changes if that observation is removed. Influence diagnostics summarise the change in fitted values or coefficients caused by deletion.
[definition: Cooks Distance]
For observation $i$, Cook's distance is
\begin{align*}
D_i = \frac{\|\hat{y}-\hat{y}_{(i)}\|_2^2}{p s^2},
\end{align*}
where $\hat{y}_{(i)}=X\hat{\beta}_{(i)}$ denotes the vector of fitted values at all original design rows using the fit with observation $i$ deleted.
[/definition]
Cook's distance measures the squared movement of the fitted mean vector, scaled by the estimated noise level and by the number of fitted coefficients. It is invariant under nonsingular linear reparameterisations of the columns of $X$, because it is defined through fitted values rather than through a particular coordinate representation of $\beta$.
For computation and interpretation, the remaining question is how this global deletion measure relates to quantities already visible in the original fit. A useful identity shows that influence is large only when residual size and leverage combine, so neither an unusual response nor an unusual design point is sufficient by itself.
[quotetheorem:4458]
[citeproof:4458]
The hypotheses ensure that the ordinary fit, deletion comparison, and scale estimate are all defined. If $n=p$, then the full-rank least-squares fit interpolates the data, the residual degrees of freedom are zero, and $s^2=\operatorname{RSS}/(n-p)$ is not available; in that case Cook's distance in this form has no meaning. If $X$ loses rank, the fitted values may still be describable through a projection, but the coefficient-based derivation and the value of $p$ used for scaling require a separate convention. The identity displays the two components of influence: a point with large studentized residual but small leverage may look unusual without moving the fitted line much, while a point with large leverage and non-negligible residual can dominate the fitted mean vector. Cook's distance is also limited to measuring change in the fitted mean under deletion; it does not decide whether the observation is erroneous, whether the model class is wrong, or whether a scientifically important extreme design point should be retained. The next diagnostic, DFFITS, narrows the same deletion idea to the fitted value at the observation's own design point.
[definition: DFFITS]
For observation $i$, DFFITS is
\begin{align*}
\operatorname{DFFITS}_i = \frac{\hat{y}_i - \hat{y}_{i(i)}}{s_{(i)}\sqrt{h_{ii}}}.
\end{align*}
[/definition]
DFFITS focuses on the change in the fitted value at the observation's own design point. Cook's distance aggregates the change across all fitted values, while DFFITS asks whether the prediction at $x_i$ itself is mainly supported by the inclusion of the $i$th observation.
[quotetheorem:4459]
[citeproof:4459]
The extra assumptions in the DFFITS identity rule out degenerate cases in which the displayed diagnostic is not defined. If $h_{ii}=0$, then the fitted value at $x_i$ is unaffected by the response direction in the way measured by DFFITS, and the denominator $s_{(i)}\sqrt{h_{ii}}$ vanishes. If $s_{(i)}=0$, the deleted model has no residual scale left, as can occur in an exactly fitting or nearly saturated deletion fit. DFFITS therefore has the same deletion-model limitation as Cook's distance, but it answers a more local question: how much the fitted value at the deleted observation's design point depends on that observation. Influence diagnostics should be followed by substantive investigation rather than automatic deletion. An influential observation might be a recording error, a data point from a different population, or a scientifically important part of the design space where the fitted model is extrapolating.
## Multicollinearity and Unstable Coefficients
Even when no single row dominates the fit, the columns of the design matrix can make coefficient estimates unstable. The question here is whether the predictors contain nearly redundant information, so that small changes in the data produce large changes in individual coefficient estimates.
[definition: Multicollinearity]
In a regression design matrix $X$, multicollinearity means that at least one column of $X$ is well approximated by a linear combination of the other columns. Exact multicollinearity means that the columns of $X$ are linearly dependent.
[/definition]
Exact multicollinearity makes $X^\top X$ singular, so the ordinary least-squares coefficient vector is not unique. Near multicollinearity leaves the estimator defined but makes some directions in coefficient space weakly determined by the data.
To isolate the issue for a single predictor, suppose the model includes an intercept and let $x_j$ be the centred column for predictor $j$. Regress $x_j$ on the remaining predictors and write $R_j^2$ for the coefficient of determination in that auxiliary regression.
[definition: Variance Inflation Factor]
For predictor $j$, the variance inflation factor is
\begin{align*}
\operatorname{VIF}_j = \frac{1}{1-R_j^2},
\end{align*}
where $R_j^2$ is obtained by regressing predictor $j$ on the remaining predictors in the design.
[/definition]
A large value of $\operatorname{VIF}_j$ means that predictor $j$ is nearly explained by the other predictors. The coefficient of $x_j$ is then estimated from the part of $x_j$ orthogonal to the other columns, which may contain little variation.
To see exactly how this loss of independent variation enters uncertainty, we need the variance formula for a single coefficient in terms of the auxiliary regression. The obstruction is geometric: as $x_j$ becomes nearly a linear combination of the other predictors, the denominator controlling the coefficient's precision collapses.
[quotetheorem:4460]
[citeproof:4460]
The centering and intercept assumptions ensure that $S_{jj}$ measures variation around the sample mean and that the auxiliary regression for $R_j^2$ is aligned with the coefficient of predictor $j$ in the original model. Without an intercept, or without centering when an intercept is present, the same formula can mix location effects with collinearity effects and the auxiliary $R_j^2$ no longer isolates the residual variation in the predictor in the same way. The fixed full-rank design assumption is needed because the conditional variance formula uses $(X^\top X)^{-1}$; with exact collinearity, for instance when $x_j$ is the sum of two other included columns, $R_j^2=1$ and the coefficient is not uniquely estimable. Homoskedasticity is also essential for the displayed variance: if the error variances differ by observation, the correct conditional covariance is the sandwich form rather than $\sigma^2(X^\top X)^{-1}$. This theorem separates two reasons a coefficient can be imprecise. If $S_{jj}$ is small, predictor $j$ has little variation in the sample. If $R_j^2$ is close to $1$, predictor $j$ varies but mostly in a direction already explained by other predictors. The condition number below extends this single-predictor diagnosis to global near-dependence among all columns.
[definition: Condition Number]
For a full-rank matrix $X$, the spectral condition number is
\begin{align*}
\kappa(X)=\frac{\sigma_{\max}(X)}{\sigma_{\min}(X)},
\end{align*}
where $\sigma_{\max}(X)$ and $\sigma_{\min}(X)$ are the largest and smallest singular values of $X$.
[/definition]
The condition number measures global near-dependence among columns. Since the eigenvalues of $X^\top X$ are the squared singular values of $X$, a large condition number means that $(X^\top X)^{-1}$ greatly expands noise in some coefficient directions.
[example: Near Collinear Predictors in House Price Data]
In a house-price regression, let $x_1$ be centred and scaled total square footage and let $x_2$ be centred and scaled above-ground living area. If these two predictors have sample correlation $0.98$, then the auxiliary regression of $x_1$ on $x_2$ alone has
\begin{align*}
R_1^2=(0.98)^2=0.9604.
\end{align*}
Therefore the variance inflation factor for the total-square-footage coefficient is
\begin{align*}
\operatorname{VIF}_1
&=\frac{1}{1-R_1^2}\\
&=\frac{1}{1-0.9604}\\
&=\frac{1}{0.0396}\\
&=25.2525\ldots.
\end{align*}
Thus only $3.96\%$ of the variation in total square footage remains after removing the part explained by above-ground living area, and the variance of its coefficient is inflated by a factor of about $25.25$ relative to an orthogonal predictor with the same sample variation.
The instability is about separating effects, not necessarily about prediction. For instance, if two centred predictors are nearly equal, write
\begin{align*}
x_1=z+d,\qquad x_2=z-d,
\end{align*}
where $d$ is small compared with $z$. Two coefficient pairs can then give nearly the same fitted contribution:
\begin{align*}
200x_1-150x_2
&=200(z+d)-150(z-d)\\
&=200z+200d-150z+150d\\
&=50z+350d,
\end{align*}
while
\begin{align*}
50x_1+0x_2
&=50(z+d)\\
&=50z+50d.
\end{align*}
The difference between these fitted contributions is
\begin{align*}
(50z+350d)-(50z+50d)=300d,
\end{align*}
which is small when the two predictors differ only slightly. So a model may predict prices accurately while assigning unstable signs and magnitudes to total square footage, above-ground living area, rooms, and assessed value; replacing redundant predictors by one interpretable size measure can stabilize the coefficients while changing fitted values only modestly.
[/example]
Multicollinearity is therefore mostly a problem for interpretation of individual coefficients, not necessarily for prediction. If future observations have the same collinearity pattern as the training data, fitted values may be stable while coefficients are not. If the model is used for extrapolation or causal interpretation, near-collinearity becomes more serious because the data contain little information about separating the effects of the correlated predictors.
[remark: Scaling Before Condition Numbers]
Condition numbers depend on the units of the columns of $X$. Before using a condition number to diagnose collinearity among predictors, continuous predictors should be centred and scaled in a way that matches the modelling question. Otherwise a large condition number may reflect mixed measurement units rather than a near-linear dependence in the statistical design.
[/remark]
## Diagnostic Workflow
The final question is how these diagnostics fit into an analysis without turning model checking into a list of unrelated thresholds. A useful workflow moves from global structure to individual observations and then to design stability.
First, inspect the response against key predictors and inspect residuals against fitted values, predictors, and collection order. This stage looks for nonlinearity, changing variance, dependence, and gross data errors. Second, examine leverage, studentized residuals, deleted residuals, Cook's distance, and DFFITS to identify observations whose removal would materially change the fitted model. Third, inspect VIFs, condition numbers, and coefficient sensitivity under reasonable changes of predictor set to assess whether individual coefficients are stable.
[example: Combining Residual and Influence Diagnostics]
Consider a regression of energy usage on temperature, building size, and occupancy, with an intercept, so $p=4$ fitted coefficients. Suppose the residuals averaged over three temperature bands are
\begin{align*}
\bar e_{\text{cold}}&=14.2,\\
\bar e_{\text{mild}}&=-9.6,\\
\bar e_{\text{hot}}&=13.1.
\end{align*}
Because the residuals are positive at low temperatures, negative at moderate temperatures, and positive again at high temperatures, the residual plot against temperature has a curved pattern rather than random scatter around $0$. This supports replacing a single linear temperature term by separate heating and cooling terms, for example by using
\begin{align*}
\operatorname{Heat}=(65-\operatorname{Temp})_+,\qquad
\operatorname{Cool}=(\operatorname{Temp}-65)_+.
\end{align*}
Now suppose one very large building has leverage $h_{ii}=0.42$, residual $e_i=18$, residual scale $s=12$, and $p=4$. Its internally studentized residual is
\begin{align*}
r_i
&=\frac{e_i}{s\sqrt{1-h_{ii}}}\\
&=\frac{18}{12\sqrt{1-0.42}}\\
&=\frac{18}{12\sqrt{0.58}}
\approx 1.97.
\end{align*}
Using the Cook's distance identity,
\begin{align*}
D_i
&=\frac{r_i^2}{p}\frac{h_{ii}}{1-h_{ii}}\\
&=\frac{(1.97)^2}{4}\frac{0.42}{0.58}\\
&=\frac{3.8809}{4}\cdot 0.7241\\
&\approx 0.70.
\end{align*}
The residual is not extreme by itself, but the leverage factor $0.42/0.58$ makes deletion of this building capable of moving the fitted regression noticeably; the analyst should compare the fitted model with and without this building and check whether it belongs to the same population of buildings.
Finally, suppose building size and assessed value are both included and the auxiliary regression of size on assessed value and the other predictors has $R^2=0.94$. Then
\begin{align*}
\operatorname{VIF}_{\text{size}}
&=\frac{1}{1-R^2}\\
&=\frac{1}{1-0.94}\\
&=\frac{1}{0.06}\\
&=16.666\ldots.
\end{align*}
Thus only $6\%$ of the variation in building size remains after removing the part explained by the other predictors, and the variance of the size coefficient is inflated by a factor of about $16.7$. The diagnostics point to three different issues: temperature curvature suggests mean misspecification, the extreme building tests sensitivity to deletion, and the high VIF explains unstable individual coefficients even when predictions remain accurate.
[/example]
The common principle is to ask what feature of the data each diagnostic is measuring. Residual plots measure model misfit in the response direction, leverage measures unusual predictor values, influence measures the effect of deletion on the fitted regression, and collinearity diagnostics measure weakly identified coefficient directions. A reliable regression analysis reports the findings that affect interpretation, explains any modelling changes made in response, and distinguishes data errors from genuine but influential observations.
Diagnostics show that not every large residual or extreme leverage point should be treated alike: some reveal data problems, while others identify influential structure in the design. When ordinary least squares is unstable or too variable, the natural response is not only to inspect the fit but to regularize it.
# 9. Beyond OLS: Regularization and High-Dimensional Regression
Regularization enters when ordinary least squares is too variable, too unstable, or no longer uniquely determined. In Chapters 2--6, least squares was supported by projection geometry, finite-sample inference, and large-sample covariance approximations under fixed low-dimensional designs. Here the guiding question changes: when the design has many correlated variables, or even more variables than observations, how should we trade a small amount of bias for a substantial reduction in prediction error?
The chapter studies ridge regression and lasso as two canonical answers. Ridge regression shrinks coefficients continuously and is especially useful under multicollinearity. Lasso combines shrinkage with variable selection, producing sparse fitted models. The final section explains how tuning parameters are chosen in practice and why inference after data-driven selection needs extra care.
## Ridge Regression and Shrinkage Along Principal Components
What breaks down when the columns of the design matrix are nearly linearly dependent? The OLS estimator exists when $X^\top X$ is invertible, but small eigenvalues of $X^\top X$ make $(X^\top X)^{-1}$ large in some directions. Ridge regression stabilizes the problem by penalizing large coefficient vectors.
Throughout this section, let $y \in \mathbb R^n$ be the response vector and let $X \in \mathbb R^{n \times p}$ be a fixed design matrix. Unless stated otherwise, assume the columns of $X$ have been centred and scaled, and assume the intercept has been treated separately and is not penalized.
[definition: Ridge Regression Estimator]
For $\rho > 0$, the ridge regression estimator is
\begin{align*}
\hat{\beta}^{\mathrm{ridge}}(\rho) := \operatorname*{argmin}_{\beta \in \mathbb R^p}\left\{|y-X\beta|^2 + \rho |\beta|^2\right\}.
\end{align*}
[/definition]
The penalty $\rho |\beta|^2$ makes large coefficient vectors costly. A larger value of $\rho$ gives more shrinkage, while $\rho$ close to $0$ makes the criterion close to the OLS residual sum of squares.
For computation and interpretation, the minimization problem should be converted into an explicit estimator. The key question is how the penalty modifies the normal equations and whether it removes the instability caused by small or zero eigenvalues of $X^\top X$.
[quotetheorem:4461]
[citeproof:4461]
This formula explains the stabilizing effect of ridge regression. Even if $X^\top X$ is singular, adding $\rho I_p$ moves every eigenvalue away from $0$.
[example: Ridge Under Multicollinearity]
Suppose two standardized predictors record nearly the same information, and write $|x_1|^2=|x_2|^2=1$ and $x_1^\top x_2=c$ with $c$ close to $1$. Rotate the coefficient vector by setting
\begin{align*}
z_+ &= \frac{x_1+x_2}{\sqrt 2},&
z_- &= \frac{x_1-x_2}{\sqrt 2},\\
a &= \frac{\beta_1+\beta_2}{\sqrt 2},&
b &= \frac{\beta_1-\beta_2}{\sqrt 2}.
\end{align*}
Then
\begin{align*}
x_1\beta_1+x_2\beta_2
&= \frac{x_1+x_2}{\sqrt 2}\frac{\beta_1+\beta_2}{\sqrt 2}
+ \frac{x_1-x_2}{\sqrt 2}\frac{\beta_1-\beta_2}{\sqrt 2}\\
&= z_+a+z_-b,
\end{align*}
and the penalty is unchanged by the rotation:
\begin{align*}
a^2+b^2
&= \frac{(\beta_1+\beta_2)^2+(\beta_1-\beta_2)^2}{2}\\
&= \frac{\beta_1^2+2\beta_1\beta_2+\beta_2^2+\beta_1^2-2\beta_1\beta_2+\beta_2^2}{2}\\
&= \beta_1^2+\beta_2^2.
\end{align*}
The rotated design directions satisfy
\begin{align*}
|z_+|^2
&= \frac{(x_1+x_2)^\top(x_1+x_2)}{2}
= \frac{1+2c+1}{2}
= 1+c,\\
|z_-|^2
&= \frac{(x_1-x_2)^\top(x_1-x_2)}{2}
= \frac{1-2c+1}{2}
= 1-c,\\
z_+^\top z_-
&= \frac{(x_1+x_2)^\top(x_1-x_2)}{2}
= \frac{1-c+c-1}{2}
= 0.
\end{align*}
Thus the ridge criterion in the two rotated coordinates is
\begin{align*}
|y-az_+-bz_-|^2+\rho(a^2+b^2)
&= y^\top y-2a z_+^\top y-2b z_-^\top y\\
&\quad +a^2|z_+|^2+b^2|z_-|^2+2abz_+^\top z_-+\rho a^2+\rho b^2\\
&= y^\top y-2a z_+^\top y-2b z_-^\top y\\
&\quad +(1+c+\rho)a^2+(1-c+\rho)b^2.
\end{align*}
Differentiating with respect to $a$ and $b$ gives
\begin{align*}
-2z_+^\top y+2(1+c+\rho)a&=0,&
-2z_-^\top y+2(1-c+\rho)b&=0,
\end{align*}
so
\begin{align*}
\hat a_{\mathrm{ridge}} &= \frac{z_+^\top y}{1+c+\rho},&
\hat b_{\mathrm{ridge}} &= \frac{z_-^\top y}{1-c+\rho}.
\end{align*}
Without the ridge penalty, the corresponding least-squares coefficients are
\begin{align*}
\hat a_{\mathrm{OLS}} &= \frac{z_+^\top y}{1+c},&
\hat b_{\mathrm{OLS}} &= \frac{z_-^\top y}{1-c}.
\end{align*}
Therefore
\begin{align*}
\hat a_{\mathrm{ridge}}
&= \frac{1+c}{1+c+\rho}\hat a_{\mathrm{OLS}},&
\hat b_{\mathrm{ridge}}
&= \frac{1-c}{1-c+\rho}\hat b_{\mathrm{OLS}}.
\end{align*}
When $c$ is close to $1$, the contrast direction $z_-=(x_1-x_2)/\sqrt 2$ has very small squared length $1-c$, so its shrinkage factor $(1-c)/(1-c+\rho)$ is much smaller than the common-information shrinkage factor $(1+c)/(1+c+\rho)$. Since $b=(\beta_1-\beta_2)/\sqrt 2$ measures the difference between the two coefficients, ridge strongly suppresses the large opposite-signed coefficient pattern that multicollinearity makes unstable.
[/example]
The example shows one unstable contrast direction by hand, but real designs can have many such directions. To describe all of them at once, we need a coordinate system adapted to the design matrix itself. The singular value decomposition supplies that coordinate system and reveals the exact shrinkage factor applied along each information direction.
[quotetheorem:4462]
[citeproof:4462]
The factor $d_j^2/(d_j^2+\rho)$ lies between $0$ and $1$. Directions with large $d_j$ are close to OLS, while directions with small $d_j$ are damped heavily.
[remark: Penalized and Constrained Forms]
Ridge regression can also be written as minimizing $|y-X\beta|^2$ subject to $|\beta|^2 \le t$ for a suitable $t$. The penalized parameter $\rho$ and the constraint radius $t$ encode the same tradeoff from two perspectives, although the exact relationship depends on the data.
[/remark]
## Lasso Regression and Sparsity
What if the goal is not only stable prediction, but also a model involving a small subset of the available variables? In high-dimensional applications, many columns of $X$ may be weakly relevant or irrelevant. The lasso replaces the squared Euclidean penalty by an absolute-value penalty, and this change produces exact zeros in the fitted coefficient vector.
[definition: Lasso Estimator]
For $\lambda > 0$, the lasso estimator is any solution of
\begin{align*}
\hat{\beta}^{\mathrm{lasso}}(\lambda) \in \operatorname*{argmin}_{\beta \in \mathbb R^p}\left\{\frac{1}{2}|y-X\beta|^2 + \lambda \sum_{j=1}^p |\beta_j|\right\}.
\end{align*}
[/definition]
The lasso objective is convex but not differentiable at points where some $\beta_j=0$. Ordinary derivatives are therefore unavailable at precisely the values that make the estimator sparse. To write first-order conditions that still include zero coefficients, we need the subgradient of the absolute-value function.
[definition: Subgradient of the Absolute Value]
The subgradient of $t \mapsto |t|$ at $a \in \mathbb R$ is the set
\begin{align*}
\partial |a| :=
\begin{cases}
\{1\}, & a>0,\\
[-1,1], & a=0,\\
\{-1\}, & a<0.
\end{cases}
\end{align*}
[/definition]
The subgradient interval at zero gives a way for a variable to remain excluded without violating optimality. The next step is to combine this one-dimensional subgradient rule with the residual correlations from the squared-error term, producing a coordinatewise criterion that distinguishes active variables from variables held exactly at zero.
[quotetheorem:4463]
[citeproof:4463]
The KKT conditions give an operational interpretation. A variable enters the fitted model only when its correlation with the current residual reaches the threshold $\lambda$ in magnitude.
[example: Sparse Feature Selection in Text Counts]
Let $x_j$ be the centered and scaled count vector for word $j$, let $\hat{\beta}$ be a lasso solution, and let
\begin{align*}
r = y-X\hat{\beta}
\end{align*}
be the fitted residual vector. By the *Karush-Kuhn-Tucker Conditions for Lasso*, there is a vector $z$ such that
\begin{align*}
X^\top r &= \lambda z,\\
z_j &\in \partial |\hat{\beta}_j|.
\end{align*}
Taking the $j$th coordinate of $X^\top r=\lambda z$ gives
\begin{align*}
x_j^\top r=\lambda z_j.
\end{align*}
If the word $j$ is excluded from the lasso model, then $\hat{\beta}_j=0$, so the subgradient rule gives $z_j\in[-1,1]$. Therefore
\begin{align*}
|x_j^\top r|
&=|\lambda z_j|\\
&=\lambda |z_j|\\
&\le \lambda.
\end{align*}
Thus a word with zero coefficient has residual correlation no larger than $\lambda$ in magnitude. If the word $j$ is included, then $\hat{\beta}_j\ne 0$, so $z_j=\operatorname{sgn}(\hat{\beta}_j)$ and
\begin{align*}
x_j^\top r
&=\lambda z_j\\
&=\lambda \operatorname{sgn}(\hat{\beta}_j).
\end{align*}
For example, a lasso fit may keep words such as "excellent", "refund", "slow", and "recommend" precisely because their residual correlations reach the signed threshold, while many other word-count columns remain at zero because their residual correlations stay inside the interval $[-\lambda,\lambda]$.
[/example]
Coordinate descent is a common algorithmic view of the lasso. Instead of solving all coordinates at once, it repeatedly updates one coefficient while holding the rest fixed.
[explanation: Coordinate Descent Intuition]
Fix all coefficients except $\beta_j$, and write the partial residual as
\begin{align*}
r_j = y - \sum_{k \ne j} x_k\beta_k.
\end{align*}
The one-dimensional update minimizes
\begin{align*}
\frac{1}{2}|r_j-x_j\beta_j|^2 + \lambda |\beta_j|.
\end{align*}
If the columns are standardized so that $|x_j|^2=1$, the update is the soft-thresholding rule
\begin{align*}
\beta_j \leftarrow S_\lambda(x_j^\top r_j), \qquad
S_\lambda(t)=\operatorname{sgn}(t)(|t|-\lambda)_+.
\end{align*}
Thus small partial correlations are sent to zero, while larger partial correlations are reduced toward zero by $\lambda$. Repeating this update over coordinates gradually finds a solution of the convex optimization problem.
[/explanation]
## Bias-Variance Tradeoff and Tuning by Cross-Validation
How should the tuning parameter be chosen? A very small penalty gives a flexible model with low training error but potentially high variance. A very large penalty gives a stable model that may be biased because important signal is shrunk too much.
[definition: Prediction Risk]
For an estimator $\hat{f}$ of the regression function and a fresh observation $(X_{\mathrm{new}},Y_{\mathrm{new}})$ from the same population, the squared prediction risk is
\begin{align*}
R(\hat{f}) := \mathbb E[(Y_{\mathrm{new}}-\hat{f}(X_{\mathrm{new}}))^2].
\end{align*}
[/definition]
The risk is not the same as training error. Training error usually decreases as the model becomes more flexible, while prediction risk may increase after the estimator begins fitting noise.
To understand why a penalty can improve prediction even while increasing bias, we need to separate the two ways an estimator can miss a future response. The relevant decomposition identifies the squared bias, the sampling variance of the fitted rule, and the irreducible noise in the response.
[quotetheorem:4464]
[citeproof:4464]
Ridge and lasso deliberately introduce bias. Their usefulness comes from cases where the corresponding variance reduction is larger than the added squared bias.
This perspective is why tuning cannot be judged from the training residuals alone. A penalty that looks worse in sample may give better predictions because it stabilizes the fitted rule on future data. The practical task is therefore to estimate prediction risk for several candidate penalties using data not used to fit that candidate model. In penalized regression, this requires a validation procedure that repeats the entire fitting step separately for each candidate penalty and each held-out fold.
[definition: K-Fold Cross-Validation]
Given candidate tuning parameters $\Lambda$ and an integer $K\ge 2$, $K$-fold cross-validation partitions the data into folds $I_1,\dots,I_K$. For each $\lambda \in \Lambda$, it fits the model on the data outside $I_k$ and computes prediction error on $I_k$, then averages over $k=1,\dots,K$.
[/definition]
After the definition, the usual selected tuning parameter is the value of $\lambda$ with the smallest average validation error. Some practitioners use the one-standard-error rule: choose the largest penalty whose validation error is within one estimated standard error of the minimum, favouring a simpler model.
[example: Predicting Test Score from Many Demographic Variables]
Suppose $Y$ is a standardized test score and the columns of $X$ include school indicators, neighbourhood summaries, parental education variables, and interactions among them. If a small demographic group has only a few observations, an unpenalized least-squares fit can assign a large coefficient to that group whenever doing so reduces its training residuals, even if the apparent difference is mostly noise.
For ridge regression, a candidate penalty $\rho$ is evaluated by fitting
\begin{align*}
\hat{\beta}^{(-k)}_{\rho}
=
\operatorname*{argmin}_{\beta}
\left\{
\sum_{i\notin I_k}(Y_i-X_i^\top\beta)^2+\rho|\beta|^2
\right\}
\end{align*}
on the data outside fold $I_k$, and then computing the validation error on the held-out fold:
\begin{align*}
\operatorname{CV}(\rho)
=
\frac{1}{K}\sum_{k=1}^K
\frac{1}{|I_k|}
\sum_{i\in I_k}
\left(Y_i-X_i^\top\hat{\beta}^{(-k)}_{\rho}\right)^2.
\end{align*}
When $\rho=0$, the fit behaves like ordinary least squares and can give large weights to unstable directions involving small groups or highly correlated demographic variables. For $\rho>0$, the extra term $\rho|\beta|^2$ makes a coefficient vector with many large entries costly, so related school, neighbourhood, and interaction coefficients are pulled toward zero together.
For lasso, the corresponding fold-specific fit is
\begin{align*}
\hat{\beta}^{(-k)}_{\lambda}
\in
\operatorname*{argmin}_{\beta}
\left\{
\frac{1}{2}\sum_{i\notin I_k}(Y_i-X_i^\top\beta)^2
+\lambda\sum_{j=1}^p|\beta_j|
\right\}.
\end{align*}
The validation error is computed by the same held-out average,
\begin{align*}
\operatorname{CV}(\lambda)
=
\frac{1}{K}\sum_{k=1}^K
\frac{1}{|I_k|}
\sum_{i\in I_k}
\left(Y_i-X_i^\top\hat{\beta}^{(-k)}_{\lambda}\right)^2.
\end{align*}
The absolute-value penalty can set some coefficients exactly to zero, so a lasso fit may keep only a smaller subset of school indicators, parental education variables, and interactions.
As the penalty increases from zero, validation error can decrease because unstable coefficients are shrunk or removed. If the penalty becomes too large, important signal is also shrunk toward zero, so the held-out prediction errors
\begin{align*}
\left(Y_i-X_i^\top\hat{\beta}^{(-k)}_{\rho}\right)^2
\quad \text{or} \quad
\left(Y_i-X_i^\top\hat{\beta}^{(-k)}_{\lambda}\right)^2
\end{align*}
begin to increase. The cross-validation curve therefore records the bias-variance tradeoff in prediction rather than the training error alone.
[/example]
## Post-Selection Caution in High-Dimensional Regression
What changes after a model has been selected using the same data used to estimate it? Standard confidence intervals and $p$-values from an OLS refit treat the selected model as if it had been fixed before seeing the data. That ignores the randomness introduced by the selection step.
[remark: Post-Selection Inference]
After lasso selects a subset of variables, refitting OLS on the selected variables can reduce shrinkage bias for prediction. However, ordinary OLS standard errors from the refit do not account for the fact that the selected subset was data-dependent. Valid inferential statements require either sample splitting, selective inference methods, or a pre-specified model chosen independently of the outcome data.
[/remark]
The distinction between prediction and inference is essential in high-dimensional regression. For prediction, cross-validated regularization can be effective even when the selected variables are not stable. For scientific interpretation, instability of the selected set is itself evidence that coefficient-level conclusions should be treated cautiously.
[example: Selection Instability]
Consider the idealized case where the two word-count columns are identical after centering and scaling: $x_1=x_2=x$ and $|x|^2=1$. Holding all other fitted coefficients fixed, let $r$ be the partial residual left for these two words, and write $t=x^\top r$. The two-coordinate lasso subproblem is
\begin{align*}
\min_{\beta_1,\beta_2}
\left\{
\frac{1}{2}|r-x\beta_1-x\beta_2|^2+\lambda(|\beta_1|+|\beta_2|)
\right\}.
\end{align*}
If $t>\lambda$ and we restrict to nonnegative coefficients, only the sum $s=\beta_1+\beta_2$ affects the fitted value:
\begin{align*}
x\beta_1+x\beta_2=x(\beta_1+\beta_2)=xs.
\end{align*}
For any split with $\beta_1,\beta_2\ge 0$ and $\beta_1+\beta_2=s$, the penalty is
\begin{align*}
|\beta_1|+|\beta_2|=\beta_1+\beta_2=s,
\end{align*}
so the objective becomes
\begin{align*}
\frac{1}{2}|r-xs|^2+\lambda s
&=\frac{1}{2}(r-xs)^\top(r-xs)+\lambda s\\
&=\frac{1}{2}(r^\top r-2s x^\top r+s^2x^\top x)+\lambda s\\
&=\frac{1}{2}r^\top r-st+\frac{1}{2}s^2+\lambda s.
\end{align*}
Differentiating with respect to $s$ gives
\begin{align*}
\frac{d}{ds}\left(\frac{1}{2}r^\top r-st+\frac{1}{2}s^2+\lambda s\right)
&=-t+s+\lambda,
\end{align*}
so the minimizer satisfies
\begin{align*}
-t+s+\lambda=0,
\qquad
\hat{s}=t-\lambda.
\end{align*}
Thus both coefficient vectors
\begin{align*}
(\hat{\beta}_1,\hat{\beta}_2)=(t-\lambda,0)
\quad \text{and} \quad
(\hat{\beta}_1,\hat{\beta}_2)=(0,t-\lambda)
\end{align*}
give the same fitted contribution,
\begin{align*}
x\hat{\beta}_1+x\hat{\beta}_2=x(t-\lambda),
\end{align*}
and the same penalty,
\begin{align*}
\lambda(|\hat{\beta}_1|+|\hat{\beta}_2|)
=\lambda(t-\lambda).
\end{align*}
In real text data the two synonym columns are usually not exactly equal, but bootstrap resampling can make one column slightly more aligned with the residual in one sample and the other slightly more aligned in another. The predictions can therefore remain nearly unchanged while the selected word list changes, so sparsity should not be read as a definitive causal or explanatory ranking of the words.
[/example]
Regularization extends linear regression from a fixed low-dimensional inferential tool to a flexible prediction method for modern data. Ridge emphasizes stability through continuous shrinkage, lasso emphasizes sparsity through thresholding, and cross-validation supplies a practical way to tune the strength of regularization. The price is that the fitted model is no longer an unpenalized projection with the same inferential formulas as OLS, so interpretation must track the estimation and selection procedure that produced it.
Regularization changes the estimation problem by trading off bias and variance, especially in settings where OLS is ill-posed or too noisy. The final step is to move beyond Gaussian responses altogether and keep the linear predictor while changing the mean-variance relationship and likelihood structure.
# 10. Generalized Linear Models and Logistic Regression
This chapter moves from linear regression with Gaussian errors to models where the response has a non-Gaussian distribution. The organising question is how to keep the linear predictor $x_i^\top\beta$ while changing the probability law and the scale on which the mean is modelled. Generalized linear models provide this framework, and logistic regression is the central example for binary responses.
The linear model treated the conditional mean as an unconstrained real number and used Gaussian errors to make least squares coincide with maximum likelihood. This chapter keeps the linear predictor but changes both the distributional model and the scale on which the mean is linear. The main examples are exponential-family generalized linear models, binary logistic regression, likelihood-based fitting, and large-sample inference through information and deviance.
## Modelling Means Through Exponential Families
When responses are counts, proportions, or binary outcomes, the constant-variance normal model no longer matches the shape of the data. The first problem is to find a family of distributions rich enough for inference but structured enough that likelihood equations resemble the normal equations from least squares.
[definition: One-Parameter Exponential Family]
A family of distributions for a response $Y$ is a one-parameter exponential family if its density or probability mass function can be written as
\begin{align*}
f(y;\theta,\phi)=\exp\left(\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\right),
\end{align*}
where $\theta$ is the canonical parameter, $\phi$ is a dispersion parameter, and $a$, $b$, $c$ are known functions.
[/definition]
The function $b$ controls the mean and variance. Differentiating the normalising identity for the density gives $\mathbb E[Y]=b^{\prime}(\theta)$ and $\operatorname{Var}(Y)=a(\phi)b^{\prime\prime}(\theta)$ whenever the derivatives exist.
[example: Bernoulli Exponential Family]
For $Y\sim \operatorname{Ber}(p)$ with $0<p<1$, the probability mass function is
\begin{align*}
\mathbb P(Y=y)
&=p^y(1-p)^{1-y},\qquad y\in\{0,1\}\\
&=\exp\{y\log p+(1-y)\log(1-p)\}\\
&=\exp\{y\log p+\log(1-p)-y\log(1-p)\}\\
&=\exp\left\{y\log\frac{p}{1-p}+\log(1-p)\right\}.
\end{align*}
Set
\begin{align*}
\theta=\log\frac{p}{1-p}.
\end{align*}
Then $e^\theta=p/(1-p)$, so
\begin{align*}
e^\theta(1-p)&=p,\\
e^\theta&=p(1+e^\theta),\\
p&=\frac{e^\theta}{1+e^\theta}.
\end{align*}
Therefore
\begin{align*}
1-p
&=1-\frac{e^\theta}{1+e^\theta}\\
&=\frac{1+e^\theta-e^\theta}{1+e^\theta}\\
&=\frac{1}{1+e^\theta},
\end{align*}
and hence
\begin{align*}
\log(1-p)
=\log\left(\frac{1}{1+e^\theta}\right)
=-\log(1+e^\theta).
\end{align*}
Thus the mass function becomes
\begin{align*}
\mathbb P(Y=y)
&=\exp\{y\theta-\log(1+e^\theta)\},
\end{align*}
which is the one-parameter exponential-family form with dispersion $a(\phi)=1$ and
\begin{align*}
b(\theta)=\log(1+e^\theta).
\end{align*}
Differentiating this function gives
\begin{align*}
b'(\theta)
&=\frac{1}{1+e^\theta}\frac{d}{d\theta}(1+e^\theta)\\
&=\frac{e^\theta}{1+e^\theta}\\
&=p.
\end{align*}
So the canonical parameter is the log-odds $\theta=\log(p/(1-p))$, and the mean $p$ is recovered from it by the inverse-logit formula. This is why log-odds arise from the Bernoulli likelihood itself rather than from an arbitrary choice of scale.
[/example]
The Bernoulli calculation is the prototype: the distribution determines the natural mean range, and the link records which transformation of that mean is modelled linearly.
This suggests a common framework for regression problems where the response is not naturally Gaussian or real-valued. Instead of forcing the mean itself to be linear, we specify a distributional family, a mean constrained to its allowable range, and a link that places that mean on a linear-predictor scale.
The framework needs a precise list of ingredients so that different response types can be compared without changing the basic regression architecture. The definition below records those ingredients: distribution, mean, linear predictor, and link.
[definition: Generalized Linear Model]
A generalized linear model for independent responses $Y_1,\dots,Y_n$ consists of:
1. a distribution for each $Y_i$ in a common exponential family;
2. a mean $\mu_i=\mathbb E[Y_i]$ belonging to an allowable mean space $\mathcal M\subseteq\mathbb R$;
3. a linear predictor $\eta_i=x_i^\top\beta$, where $x_i\in\mathbb R^p$ and $\beta\in\mathcal B\subseteq\mathbb R^p$;
4. a link function $g:\mathcal M\to\mathbb R$ satisfying
\begin{align*}
g(\mu_i)=\eta_i.
\end{align*}
[/definition]
The link separates the scale on which effects add from the scale on which the response is measured. In ordinary linear regression the identity link gives $\mu_i=x_i^\top\beta$; in logistic regression the mean lies in $(0,1)$, so the identity link is not suitable for unconstrained linear predictors.
Among the possible links, exponential-family models single out one especially natural choice. Since the likelihood is written in terms of the canonical parameter $\theta$, it is useful to know which transformation of the mean makes the linear predictor equal to that parameter.
To use this idea systematically, we need a definition that converts the mean parameter back to the canonical parameter whenever that conversion is possible. That definition identifies the link most directly tied to the likelihood algebra.
[definition: Canonical Link]
Let $\Theta\subseteq\mathbb R$ be the canonical-parameter space of an exponential family and let $\mathcal M=b^{\prime}(\Theta)$ be its mean space. The canonical link is the function $g:\mathcal M\to\Theta$ satisfying
\begin{align*}
g(\mu)=\theta \quad \text{whenever } \mu=b^{\prime}(\theta).
\end{align*}
[/definition]
Canonical links are important because they make likelihood derivatives take a particularly simple form. They also explain why logistic regression uses the logit link and Poisson regression uses the logarithmic link.
For estimation, the central object is the score equation obtained by differentiating the log-likelihood with respect to the coefficients. Without a simplification, the link function, variance function, and design matrix can obscure what condition the fitted coefficients actually satisfy. Under a canonical link, that obstruction disappears: the likelihood equations become a weighted residual orthogonality condition, making the connection with least-squares normal equations explicit.
[quotetheorem:4465]
[citeproof:4465]
The second display is the direct analogue of the OLS normal equations: the fitted residuals are orthogonal to every column of the design matrix. Independence is what lets the log-likelihood split into a sum of observationwise contributions; with clustered or repeated observations, the same equation may underestimate uncertainty unless the dependence is modelled or corrected. Positive variance is needed because the score formula divides by $\operatorname{Var}(Y_i)$, and a degenerate response contributes no Fisher information about the slope. Differentiability rules out links with corners or parameter values at the boundary, where the chain-rule calculation can fail. The canonical-link simplification does not say that the estimator has a closed form or that the likelihood is well behaved; it only identifies the algebraic form of the likelihood equations.
## Logistic Regression for Binary Responses
Binary regression asks for a model of $\mathbb P(Y_i=1\mid x_i)$, not for a real-valued conditional mean with additive Gaussian noise. The constraint $0<p_i<1$ forces us to choose a link that maps probabilities to the whole real line.
[definition: Logit Link]
The logit link is the function $g:(0,1)\to\mathbb R$ defined by
\begin{align*}
g(p)=\log\frac{p}{1-p}.
\end{align*}
[/definition]
The inverse logit turns a linear predictor into a probability:
\begin{align*}
p_i=\frac{e^{x_i^\top\beta}}{1+e^{x_i^\top\beta}}=\frac{1}{1+e^{-x_i^\top\beta}}.
\end{align*}
Thus logistic regression is linear on the log-odds scale and nonlinear on the probability scale.
With the link fixed, the remaining task is to state the full probabilistic model for binary data. The definition below combines the Bernoulli response assumption with the logit-linear relationship that determines the success probabilities.
[definition: Logistic Regression Model]
For independent binary responses $Y_i\in\{0,1\}$ with covariates $x_i\in\mathbb R^p$, the logistic regression model is
\begin{align*}
Y_i\sim \operatorname{Ber}(p_i),\qquad \log\frac{p_i}{1-p_i}=x_i^\top\beta.
\end{align*}
[/definition]
A coefficient is interpreted through odds rather than probability. If $x_{ij}$ increases by one while the other covariates are held fixed, then the odds $p_i/(1-p_i)$ are multiplied by $e^{\beta_j}$.
[example: Disease Presence from Age and Exposure]
Let $Y_i$ indicate whether patient $i$ has a disease, let $a_i$ be age in decades, and let $e_i\in\{0,1\}$ indicate exposure to a risk factor. In the model
\begin{align*}
\log\frac{p_i}{1-p_i}
=\beta_0+\beta_1a_i+\beta_2e_i,
\end{align*}
the odds for patient $i$ are obtained by exponentiating both sides:
\begin{align*}
\frac{p_i}{1-p_i}
&=\exp(\beta_0+\beta_1a_i+\beta_2e_i)\\
&=\exp(\beta_0)\exp(\beta_1a_i)\exp(\beta_2e_i).
\end{align*}
To isolate the age effect, hold exposure fixed at $e_i=e$ and compare ages $a$ and $a+1$. The corresponding fitted odds are
\begin{align*}
\operatorname{odds}(a,e)
&=\exp(\beta_0+\beta_1a+\beta_2e),\\
\operatorname{odds}(a+1,e)
&=\exp(\beta_0+\beta_1(a+1)+\beta_2e).
\end{align*}
Therefore
\begin{align*}
\frac{\operatorname{odds}(a+1,e)}{\operatorname{odds}(a,e)}
&=\frac{\exp(\beta_0+\beta_1(a+1)+\beta_2e)}
{\exp(\beta_0+\beta_1a+\beta_2e)}\\
&=\exp\{(\beta_0+\beta_1a+\beta_1+\beta_2e)-(\beta_0+\beta_1a+\beta_2e)\}\\
&=\exp(\beta_1).
\end{align*}
Thus each additional decade multiplies the fitted disease odds by $e^{\beta_1}$ when exposure is held fixed.
To isolate the exposure effect, hold age fixed at $a$ and compare $e=0$ with $e=1$:
\begin{align*}
\operatorname{odds}(a,0)
&=\exp(\beta_0+\beta_1a),\\
\operatorname{odds}(a,1)
&=\exp(\beta_0+\beta_1a+\beta_2).
\end{align*}
Hence
\begin{align*}
\frac{\operatorname{odds}(a,1)}{\operatorname{odds}(a,0)}
&=\frac{\exp(\beta_0+\beta_1a+\beta_2)}
{\exp(\beta_0+\beta_1a)}\\
&=\exp\{(\beta_0+\beta_1a+\beta_2)-(\beta_0+\beta_1a)\}\\
&=\exp(\beta_2).
\end{align*}
For instance, if $\hat\beta_2=0.7$, then
\begin{align*}
e^{\hat\beta_2}
=e^{0.7}
\approx 2.01,
\end{align*}
so exposed patients have about twice the fitted odds of disease compared with unexposed patients of the same age under this model.
[/example]
This example also shows why odds ratios must be interpreted at fixed covariate values: changing exposure while also changing age would mix the two multiplicative effects and no longer isolate $e^{\beta_2}$.
After interpretation comes estimation: the logistic coefficients must be chosen from the data. Unlike least squares, the nonlinear mean function prevents a closed-form normal equation for $\hat\beta$.
The key estimation question is what condition a maximum-likelihood fit must satisfy when the response is Bernoulli. The result below gives the logistic score equation, comparing observed successes with fitted probabilities along every covariate direction.
[quotetheorem:4466]
[citeproof:4466]
The score equation $\sum_i x_i(Y_i-p_i)=0$ says that the fitted model balances observed and fitted successes along each covariate direction. This interpretation relies on the Bernoulli likelihood and conditional independence; overdispersion, clustering, or misspecified binary outcomes change the information calculation even when the fitted mean equation is retained. The equation is necessary for an interior maximum, but it is not sufficient for existence or uniqueness. Under complete or quasi-complete separation, the likelihood can increase along an unbounded direction, so the score equation alone does not ensure a finite maximiser. There is no closed-form solution in general, so numerical likelihood maximisation and convergence checks are part of the statistical analysis rather than an implementation detail.
Once probabilities have been estimated, many applications require a binary decision rather than only a probability report. To turn fitted probabilities into predicted classes, one must choose a threshold and state the induced decision rule explicitly.
[definition: Classification Rule from Logistic Regression]
Given a threshold $t\in(0,1)$, the logistic-regression classification rule is the map $C_t:[0,1]\to\{0,1\}$ defined on a fitted probability $\hat p_i$ by
\begin{align*}
C_t(\hat p_i)=\hat Y_i(t)=\mathbb{1}_{\{\hat p_i\ge t\}}.
\end{align*}
[/definition]
The common threshold $t=1/2$ is appropriate only when false positives and false negatives have equal costs and the class proportions used for fitting match the prediction setting. Changing $t$ changes classification performance without changing the fitted probability model.
[example: Confusion Matrix and ROC Curve]
For a fitted disease model with fitted probabilities $\hat p_i$, fix a threshold $t\in(0,1)$ and classify observation $i$ by
\begin{align*}
\hat Y_i(t)=\mathbb 1_{\{\hat p_i\ge t\}}.
\end{align*}
Comparing $\hat Y_i(t)$ with the observed binary outcome $Y_i\in\{0,1\}$ gives the four confusion-matrix counts
\begin{align*}
\operatorname{TP}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}\mathbb 1_{\{\hat Y_i(t)=1\}},\\
\operatorname{FP}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=0\}}\mathbb 1_{\{\hat Y_i(t)=1\}},\\
\operatorname{TN}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=0\}}\mathbb 1_{\{\hat Y_i(t)=0\}},\\
\operatorname{FN}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}\mathbb 1_{\{\hat Y_i(t)=0\}}.
\end{align*}
For the truly diseased observations, each case has either $\hat Y_i(t)=1$ or $\hat Y_i(t)=0$, so
\begin{align*}
\operatorname{TP}(t)+\operatorname{FN}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}\mathbb 1_{\{\hat Y_i(t)=1\}}
+\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}\mathbb 1_{\{\hat Y_i(t)=0\}}\\
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}
\left(\mathbb 1_{\{\hat Y_i(t)=1\}}+\mathbb 1_{\{\hat Y_i(t)=0\}}\right)\\
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=1\}}.
\end{align*}
Thus sensitivity is the fraction of diseased observations correctly classified as positive:
\begin{align*}
\operatorname{Sensitivity}(t)
=\frac{\operatorname{TP}(t)}{\operatorname{TP}(t)+\operatorname{FN}(t)}.
\end{align*}
Similarly, among the truly non-diseased observations,
\begin{align*}
\operatorname{TN}(t)+\operatorname{FP}(t)
&=\sum_{i=1}^n \mathbb 1_{\{Y_i=0\}},
\end{align*}
so specificity and false positive rate are
\begin{align*}
\operatorname{Specificity}(t)
&=\frac{\operatorname{TN}(t)}{\operatorname{TN}(t)+\operatorname{FP}(t)},\\
\operatorname{FPR}(t)
&=\frac{\operatorname{FP}(t)}{\operatorname{TN}(t)+\operatorname{FP}(t)}
=1-\operatorname{Specificity}(t).
\end{align*}
As $t$ decreases from $1$ to $0$, more observations satisfy $\hat p_i\ge t$, so both $\operatorname{TP}(t)$ and $\operatorname{FP}(t)$ can only increase. The ROC curve plots the points
\begin{align*}
\left(\operatorname{FPR}(t),\operatorname{Sensitivity}(t)\right),
\qquad 0<t<1.
\end{align*}
Its area summarizes how well the fitted probabilities rank diseased observations above non-diseased observations; it does not check whether a fitted probability such as $0.8$ corresponds to an empirical disease frequency near $80\%$.
[/example]
## Iteratively Reweighted Least Squares
The likelihood equations for logistic regression resemble least squares equations, but the fitted probabilities depend nonlinearly on $\beta$. The computational question is how to exploit the local quadratic shape of the likelihood while respecting the mean-variance relationship of the GLM.
[definition: Working Weights and Working Response]
For a GLM with current parameter value $\beta^{(k)}$, fitted means $\mu_i^{(k)}$, and linear predictors $\eta_i^{(k)}$, the working weights and working response are
\begin{align*}
w_i^{(k)}&=\frac{1}{\operatorname{Var}(Y_i)}\left(\frac{\partial\mu_i}{\partial\eta_i}\right)^2,\\
z_i^{(k)}&=\eta_i^{(k)}+(Y_i-\mu_i^{(k)})\frac{\partial\eta_i}{\partial\mu_i}\bigg|_{\mu_i=\mu_i^{(k)}}.
\end{align*}
[/definition]
These quantities come from a first-order expansion of the mean function and a second-order approximation to the log-likelihood. The next iterate is obtained by weighted least squares regression of $z^{(k)}$ on the columns of $X$ using weights $w_i^{(k)}$.
[quotetheorem:4467]
[citeproof:4467]
For logistic regression, $\operatorname{Var}(Y_i)=p_i(1-p_i)$ and $d\mu_i/d\eta_i=p_i(1-p_i)$, so $w_i=p_i(1-p_i)$. The weights are largest near $p_i=1/2$ and small near probabilities close to $0$ or $1$, reflecting where a Bernoulli observation carries the most information about the slope. The invertibility of $X^\top W^{(k)}X$ is essential because the weighted least squares problem must have a unique local quadratic maximiser; exact collinearity in the design matrix makes the update non-identifiable. Near separation, some fitted probabilities approach $0$ or $1$, the corresponding weights approach $0$, and the algorithm may produce increasingly large coefficients while the likelihood still has no finite maximum. An IRLS step is therefore a local scoring step, not a guarantee of global convergence, so fitted software should be checked for warnings, monotone likelihood increase, coefficient divergence, and singular information matrices.
[example: Complete Separation and Penalized Logistic Regression]
Consider the one-slope logistic model
\begin{align*}
p_i(\beta)=\frac{e^{\beta x_i}}{1+e^{\beta x_i}},
\qquad
\log\frac{p_i(\beta)}{1-p_i(\beta)}=\beta x_i.
\end{align*}
If $Y_i=1$, the contribution to the log-likelihood is
\begin{align*}
\log p_i(\beta)
&=\log\left(\frac{e^{\beta x_i}}{1+e^{\beta x_i}}\right)\\
&=\beta x_i-\log(1+e^{\beta x_i})\\
&=-\log(1+e^{-\beta x_i}).
\end{align*}
If $Y_i=0$, the contribution is
\begin{align*}
\log(1-p_i(\beta))
&=\log\left(1-\frac{e^{\beta x_i}}{1+e^{\beta x_i}}\right)\\
&=\log\left(\frac{1+e^{\beta x_i}-e^{\beta x_i}}{1+e^{\beta x_i}}\right)\\
&=-\log(1+e^{\beta x_i}).
\end{align*}
Now assume $Y_i=0$ whenever $x_i<0$ and $Y_i=1$ whenever $x_i>0$. For $x_i>0$ and $Y_i=1$,
\begin{align*}
-\beta x_i \to -\infty
\quad\text{as}\quad \beta\to\infty,
\end{align*}
so $e^{-\beta x_i}\to 0$ and therefore
\begin{align*}
\log p_i(\beta)=-\log(1+e^{-\beta x_i})\to -\log(1)=0.
\end{align*}
For $x_i<0$ and $Y_i=0$,
\begin{align*}
\beta x_i \to -\infty
\quad\text{as}\quad \beta\to\infty,
\end{align*}
so $e^{\beta x_i}\to 0$ and therefore
\begin{align*}
\log(1-p_i(\beta))=-\log(1+e^{\beta x_i})\to -\log(1)=0.
\end{align*}
Every Bernoulli log-likelihood contribution is at most $0$, because probabilities lie in $(0,1)$ and hence their logarithms are negative. Thus the log-likelihood has supremum $0$, and the preceding limits show that $\ell(\beta)\to 0$ as $\beta\to\infty$. No finite $\beta$ attains this value, since each fitted probability is strictly between $0$ and $1$, so each nonempty contribution is strictly negative.
If instead we maximize the penalized criterion
\begin{align*}
Q_\lambda(\beta)=\ell(\beta)-\lambda\beta^2,
\qquad \lambda>0,
\end{align*}
then
\begin{align*}
Q_\lambda(\beta)\le 0-\lambda\beta^2=-\lambda\beta^2\to -\infty
\end{align*}
as $|\beta|\to\infty$. Since $Q_\lambda$ is continuous in $\beta$, its maximum is attained at some finite value. The penalty therefore replaces the separated likelihood's unbounded coefficient sequence with a finite estimate, encoding the modelling choice that perfect separation should not justify an infinite slope.
[/example]
## Large-Sample Inference and Deviance
Once maximum likelihood estimates have been computed, the next question is how to quantify uncertainty and compare nested GLMs. The answer parallels least squares inference, but the residual sum of squares is replaced by likelihood curvature and deviance.
[definition: Fisher Information for a GLM]
For a GLM with parameter domain $\mathcal B\subseteq\mathbb R^p$, the Fisher information is the map $I:\mathcal B\to\mathbb R^{p\times p}$ defined by
\begin{align*}
I(\beta)=\mathbb E_\beta[U(\beta)U(\beta)^\top]= -\mathbb E_\beta[H(\beta)],
\end{align*}
where $U(\beta)$ is the score and $H(\beta)$ is the Hessian of the log-likelihood.
[/definition]
Under standard regularity conditions, maximum likelihood estimators are approximately normal:
\begin{align*}
\hat\beta\approx \mathcal N(\beta, I(\beta)^{-1}).
\end{align*}
In applications $I(\beta)$ is replaced by the observed or fitted information at $\hat\beta$.
The practical problem is to turn this local quadratic approximation to the likelihood into tests of scientific restrictions, such as whether a group of covariates has no effect. Different tests use different parts of the likelihood geometry: the fitted estimate and its standard error, the drop in maximised likelihood, or the slope of the likelihood at the null model.
[quotetheorem:4468]
[citeproof:4468]
These three tests agree to first order in large samples but can differ in finite samples. The regularity conditions exclude common GLM failures: a boundary parameter gives a nonstandard limiting law, separation in logistic regression prevents an ordinary finite maximum, and non-identifiability or singular information destroys the quadratic approximation. Small samples can also make the normal approximation poor even when the model is identifiable. Wald tests are convenient after fitting the full model, but they are especially sensitive to unstable standard errors; likelihood-ratio tests are often more stable for comparing nested models; score tests require only the null model fit.
[definition: Deviance]
For a fitted GLM, the deviance is
\begin{align*}
D=2\{\ell_{\mathrm{sat}}-\ell(\hat\beta)\},
\end{align*}
where $\ell_{\mathrm{sat}}$ is the maximised log-likelihood of the saturated model and $\ell(\hat\beta)$ is the maximised log-likelihood of the fitted model.
[/definition]
The saturated model fits one mean parameter per observation, so the deviance measures lack of fit relative to the best possible model inside the chosen response family. For Gaussian linear models with fixed variance, this reduces to a constant multiple of the residual sum of squares.
The main reason to define deviance is to compare nested GLMs by asking whether the extra parameters substantially improve likelihood fit. The obstruction is that a smaller deviance always improves in-sample fit, even when the added variables are only fitting noise. To decide whether the decrease is larger than random sampling variation would typically produce, we need a reference distribution for the deviance difference under the smaller nested model.
[quotetheorem:4469]
[citeproof:4469]
Deviance comparisons are model comparisons, not automatic measures of predictive quality. Nestedness is needed because the saturated likelihood cancels and the difference becomes a likelihood-ratio statistic between a constrained and an unconstrained model; comparing unrelated models by this theorem has no $\chi^2$ justification. Regularity and independent restrictions are also essential: boundary constraints, aliased covariates, separation, or redundant restrictions can change the limiting distribution or make the degrees of freedom smaller than the number of written constraints. A smaller deviance means better in-sample likelihood fit, while predictive performance should be checked using held-out likelihood, calibration, and classification summaries when the scientific goal is prediction.
Generalized linear models extend regression to non-Gaussian responses by changing the conditional distribution and the link between predictors and means. After fitting and interpreting those models, the course closes by asking how to validate them on new data and communicate what they actually predict.
# 11. Prediction, Validation, and Communication
Prediction is the point at which regression leaves the fitted sample and becomes a claim about new cases. Chapters 2--10 developed estimation, inference, diagnostics, regularization, and generalized linear models; this final chapter asks how those tools should be used when the goal is to predict, validate, and report results without overstating what the data support. The main themes are honest assessment, uncertainty quantification, calibration, and reproducible communication.
## Honest Assessment Through Data Splitting
How do we estimate predictive performance when the same data can be used to choose variables, tune transformations, remove outliers, and fit the final model? The central difficulty is that in-sample fit measures the performance of a procedure on data it has already seen. Validation tries to mimic future use by separating model building from model assessment.
[definition: Training Validation And Test Split]
Let $D = \{(X_i,Y_i):1\le i\le n\}$ be a dataset. A training-validation-test split is a partition
\begin{align*}
D = D_{\mathrm{train}} \cup D_{\mathrm{val}} \cup D_{\mathrm{test}}
\end{align*}
into disjoint subsets, where $D_{\mathrm{train}}$ is used to fit candidate models, $D_{\mathrm{val}}$ is used to compare or tune candidates, and $D_{\mathrm{test}}$ is used once to estimate final out-of-sample performance.
[/definition]
The split is meaningful only relative to a modelling workflow. If the analyst looks at the test responses while choosing transformations, deleting observations, or selecting covariates, then the test set has become part of the training process.
This failure mode is important enough to name because it can occur even when the data have been physically split into separate files. The definition below isolates the broader problem: information that should be unavailable during model construction has entered the fitting or tuning procedure.
[definition: Data Leakage]
Data leakage occurs when information unavailable at prediction time, or information from the assessment sample, is used during model construction or tuning.
[/definition]
Leakage can be subtle. Standardising covariates using the full dataset before splitting, choosing a polynomial degree by looking at test error, or using post-treatment variables to predict a pre-treatment decision all contaminate assessment.
[example: Out Of Sample Housing Price Prediction]
Suppose $Y$ is sale price and $X$ contains floor area, postcode indicators, age, and number of bedrooms. Split the observed sales into disjoint sets $D_{\mathrm{train}}$, $D_{\mathrm{val}}$, and $D_{\mathrm{test}}$. The training set is used to estimate the regression and to learn preprocessing quantities. For example, if floor area is standardized, the training mean and standard deviation are
\begin{align*}
\bar x_{\mathrm{area,train}}
&= \frac{1}{|D_{\mathrm{train}}|}\sum_{i\in D_{\mathrm{train}}} x_{i,\mathrm{area}},\\
s_{\mathrm{area,train}}^2
&= \frac{1}{|D_{\mathrm{train}}|-1}\sum_{i\in D_{\mathrm{train}}}
\left(x_{i,\mathrm{area}}-\bar x_{\mathrm{area,train}}\right)^2,
\end{align*}
and every validation or test observation is transformed using these same two training-set quantities:
\begin{align*}
z_{i,\mathrm{area}}
= \frac{x_{i,\mathrm{area}}-\bar x_{\mathrm{area,train}}}{s_{\mathrm{area,train}}}.
\end{align*}
The validation set may choose between candidate rules such as a linear-price model, a log-price model, and a model with interactions. After that choice is fixed, the test set is used once to report held-out prediction error. If $\hat f$ denotes the fitted rule selected before looking at the test responses, then the test root mean squared error is
\begin{align*}
\mathrm{RMSE}_{\mathrm{test}}
&= \left\{
\frac{1}{|D_{\mathrm{test}}|}
\sum_{i\in D_{\mathrm{test}}}
\left(Y_i-\hat f(X_i)\right)^2
\right\}^{1/2}.
\end{align*}
For instance, if the three test residuals are $20000$, $-10000$, and $30000$ dollars, then
\begin{align*}
\mathrm{RMSE}_{\mathrm{test}}
&= \left\{\frac{1}{3}\left(20000^2+(-10000)^2+30000^2\right)\right\}^{1/2}\\
&= \left\{\frac{1}{3}\left(400000000+100000000+900000000\right)\right\}^{1/2}\\
&= \left\{\frac{1400000000}{3}\right\}^{1/2}\\
&\approx 21602.47.
\end{align*}
Thus the reported held-out error is about $21602$ dollars.
If postcode-level averages are computed using all sales before the split, then the feature for a test house in postcode $g$ uses
\begin{align*}
\bar Y_g
= \frac{1}{n_g}\sum_{j:\,\mathrm{postcode}_j=g} Y_j,
\end{align*}
which includes test responses whenever test houses appear in postcode $g$. The predictor has then used information from $D_{\mathrm{test}}$ during construction, so the RMSE is no longer an assessment of a rule built without the test responses. The resulting test error is typically too optimistic because part of the held-out outcome information has been encoded into the predictors themselves.
[/example]
The basic mathematical justification for a test set is that, after training is complete, the fitted rule is evaluated on observations that behave like fresh draws from the target population.
The important subtlety is that the fitted rule must be treated as fixed before the test outcomes are examined. Conditional on the training data, each test loss is then a fresh draw from the loss distribution of that fitted rule.
The formal point to justify is unbiasedness: once the training sample has fixed the fitted rule, averaging independent test losses should estimate that rule's risk. The potential failure is subtle: if the test outcomes influence model choice, preprocessing, or tuning, the losses are no longer losses of a fixed rule on fresh observations. We need a conditional independence statement after training has ended before interpreting test error as an estimate of risk.
[quotetheorem:4470]
[citeproof:4470]
The theorem explains why the assessment sample must be separated from model construction. Independence of the test set is essential: if the analyst chooses the model after inspecting test losses, the summands are no longer losses of a fixed fitted rule on fresh data. Integrability is also a real condition, since losses with undefined mean do not have an unbiased risk estimate in this sense. The result does not say the estimate has low variance; for example, a test set of ten observations can give a very noisy estimate even when it is unbiased.
When data are scarce, holding out one large test set can make both fitting and assessment unstable. Cross-validation addresses this by repeating the held-out evaluation across folds, so a precise definition is needed to specify which model is trained and which losses are averaged.
[definition: K Fold Cross Validation]
Let $D$ be partitioned into $K$ disjoint folds $D_1,\dots,D_K$. For each $k$, fit the algorithm on $D\setminus D_k$ to obtain $\hat f_{-k}$, and define the $K$-fold cross-validation estimate
\begin{align*}
\hat R_{\mathrm{CV}} = \frac{1}{n}\sum_{k=1}^K \sum_{(X_i,Y_i)\in D_k} L(Y_i,\hat f_{-k}(X_i)).
\end{align*}
[/definition]
Cross-validation uses the data more efficiently than a single validation split because each observation is used for assessment once and training $K-1$ times. Its target is the risk of the learning procedure trained on roughly $(K-1)n/K$ observations, not exactly the risk of the final model trained on all $n$ observations.
The forward question is whether those fold losses can still be interpreted as honest out-of-sample losses even though every observation helps train some of the other fitted models. This requires each assessed observation to be separated from the model that predicts it, and it requires all preprocessing and tuning that affect the prediction to be repeated within the appropriate training folds.
[quotetheorem:4471]
[citeproof:4471]
The i.i.d. assumption is doing more work than exchangeability alone: it makes the held-out observation independent of the model trained on the complementary folds. If observations are clustered, time ordered, or duplicated across folds, ordinary random cross-validation can report a loss that is too optimistic for future independent cases. The requirement that preprocessing occur inside each fold prevents leakage through centering, scaling, imputation, feature screening, or dimension reduction. When cross-validation is used to select among many models, the selected model cross-validation error can be optimistically biased. Nested cross-validation or a final test set is the standard remedy when the final performance estimate matters.
## Prediction Intervals, Mean Response Intervals, And Residual Uncertainty
What uncertainty should be reported when a fitted regression is used at a new covariate value $x_0$? There are two different questions. We may want uncertainty about the conditional mean $\mathbb E[Y\mid X=x_0]$, or we may want uncertainty about a future realised response $Y_0$ at $x_0$.
[definition: Mean Response Interval]
In a regression model with conditional mean $m(x)=\mathbb E[Y\mid X=x]$, a mean response interval at $x_0$ is an interval procedure intended to cover $m(x_0)$ with a stated probability or confidence level.
[/definition]
This interval concerns the regression surface itself, not the random scatter of future observations around that surface.
A different reporting problem arises when the user needs a range for an actual future outcome, such as the next sale price, test score, or default indicator. That range must include both uncertainty about the fitted mean and the residual variation of new observations around the mean.
[definition: Prediction Interval]
In a regression model, a prediction interval at $x_0$ is an interval procedure intended to cover a future response $Y_0$ generated at covariate value $x_0$ with a stated probability or confidence level.
[/definition]
The prediction interval is wider because it includes both uncertainty in the fitted mean and irreducible residual variation around the mean. Reporting a mean-response interval as if it were a prediction interval is a common way to understate predictive uncertainty.
To use these two interval types correctly, we need formulas that show exactly where the extra uncertainty enters. In the Gaussian linear model, the same fitted value and leverage term appear in both intervals, but prediction of a new response has an additional future error term that cannot be removed by estimating the mean more precisely.
[quotetheorem:4472]
[citeproof:4472]
Gaussianity gives the exact $t$ distribution used in the displayed intervals; without it, the same formulas are usually large-sample approximations rather than exact finite-sample procedures. Full column rank is needed so that $(X^\top X)^{-1}$ exists and the linear predictor variance is defined. Treating the design as fixed means the interval is conditional on the observed covariates, while the independent future error is what adds the extra $1$ inside the prediction interval. If the future case is drawn from a different population or has dependent noise, the stated coverage target no longer matches the prediction problem.
[example: Reporting A Treatment Coefficient With Uncertainty]
Suppose the fitted linear model is
\begin{align*}
Y_i=\beta_0+\tau T_i+Z_i^\top\gamma+\varepsilon_i,
\end{align*}
where $T_i=1$ denotes treatment and $T_i=0$ denotes control. Holding the included controls $Z$ fixed, changing $T$ from $0$ to $1$ changes the fitted conditional mean by
\begin{align*}
(\beta_0+\tau\cdot 1+Z^\top\gamma)-(\beta_0+\tau\cdot 0+Z^\top\gamma)
&=\beta_0+\tau+Z^\top\gamma-\beta_0-0-Z^\top\gamma\\
&=\tau.
\end{align*}
Thus the treatment coefficient estimates a conditional mean difference, not the uncertainty in one patient's future outcome.
For example, if $\hat\tau=4.2$, $\operatorname{SE}(\hat\tau)=1.1$, and the residual degrees of freedom are $196$, then using the $0.975$ quantile $t_{196,0.975}\approx 1.972$, the $95\%$ confidence interval for the treatment coefficient is
\begin{align*}
\hat\tau \pm t_{196,0.975}\operatorname{SE}(\hat\tau)
&=4.2\pm 1.972\cdot 1.1\\
&=4.2\pm 2.1692\\
&=[4.2-2.1692,\;4.2+2.1692]\\
&=[2.0308,\;6.3692].
\end{align*}
A reproducible table should therefore report at least $\hat\tau=4.2$, $\operatorname{SE}(\hat\tau)=1.1$, the interval $[2.03,6.37]$, the sample, and the fitted specification.
If the goal is instead to predict a future treated patient's outcome at covariate row $x_0$, the relevant center is the whole fitted linear predictor $x_0\hat\beta$, not only $\hat\tau$. For instance, if $x_0\hat\beta=72$, $\hat\sigma=9$, and $x_0(X^\top X)^{-1}x_0^\top=0.04$, then the Gaussian linear-model prediction interval has half-width
\begin{align*}
t_{196,0.975}\hat\sigma\sqrt{1+x_0(X^\top X)^{-1}x_0^\top}
&=1.972\cdot 9\sqrt{1+0.04}\\
&=1.972\cdot 9\sqrt{1.04}\\
&\approx 1.972\cdot 9\cdot 1.0198\\
&=1.972\cdot 9.1782\\
&\approx 18.10.
\end{align*}
So the individual prediction interval is
\begin{align*}
72\pm 18.10=[53.90,\;90.10].
\end{align*}
The coefficient interval describes uncertainty about an average conditional treatment contrast, while the prediction interval also includes residual variation around the fitted mean for an individual future response.
[/example]
The example also points to a boundary of the idea. The following remark, Residual Uncertainty Is Not A Nuisance, records that interpretation before the construction is used again.
[remark: Residual Uncertainty Is Not A Nuisance]
A small standard error for $x_0\hat\beta$ does not imply accurate individual prediction. If the residual variance is large, future observations vary substantially around the fitted mean even when the mean has been estimated precisely.
[/remark]
## Calibration Of Predicted Probabilities
When a regression model outputs probabilities, how do we know whether those probabilities mean what they say? Discrimination asks whether high-risk observations tend to have larger predicted probabilities than low-risk observations. Calibration asks whether events assigned probability near $q$ occur about a fraction $q$ of the time.
[definition: Calibration Function]
Let $X:(\Omega,\mathcal F)\to(E,\mathcal E)$ be a random covariate, let $Y:\Omega\to\{0,1\}$ be a binary response, and let $\hat p:E\to[0,1]$ be a measurable predicted probability map. Let $S=\operatorname{supp}(\hat p(X))$. The calibration function is the map
\begin{align*}
c:S&\to[0,1], & q&\mapsto \mathbb E[Y\mid \hat p(X)=q]
\end{align*}
for values $q\in S$ where the conditional expectation is defined.
[/definition]
A perfectly calibrated probability forecast has $c(q)=q$. In practice, calibration is estimated by binning predictions or by smoothing observed outcomes against predicted probabilities. A common failure mode is a model that ranks cases well but systematically exaggerates probabilities, such as assigning risks near $0.8$ to a group whose observed event rate is closer to $0.5$.
Because the calibration function is a conditional expectation rather than a directly observed object, analysts need a diagnostic that compares predicted probabilities with observed frequencies on validation data. A reliability curve provides that empirical view and makes systematic overprediction or underprediction visible across the probability scale.
[definition: Reliability Curve]
A reliability curve is a graphical summary that plots estimated observed event frequencies against predicted probabilities, usually after grouping predictions into bins or applying a smoother.
[/definition]
A reliability curve is descriptive, but calibration also has a precise probabilistic meaning. If a model assigns probability $q$ to many comparable cases, the observed event frequency among those cases should be $q$ in expectation. The theorem below connects that visual diagnostic to the conditional-expectation definition of perfect calibration.
[quotetheorem:4473]
[citeproof:4473]
[example: Logistic Probability Calibration Plot]
Fit a logistic regression predicting loan default from income, loan amount, credit history, and employment status. For each validation observation $i$, let $\hat p_i$ be the fitted probability and let $Y_i\in\{0,1\}$ record whether the loan defaulted. Divide the validation predictions into ten bins $B_1,\dots,B_{10}$, for example
\begin{align*}
B_1&=\{i:0\le \hat p_i<0.1\},\\
B_2&=\{i:0.1\le \hat p_i<0.2\},\\
&\ \vdots\\
B_{10}&=\{i:0.9\le \hat p_i\le 1\}.
\end{align*}
For any nonempty bin $B_b$, compute the horizontal coordinate as the average predicted probability
\begin{align*}
\bar p_b
= \frac{1}{|B_b|}\sum_{i\in B_b}\hat p_i
\end{align*}
and the vertical coordinate as the observed default fraction
\begin{align*}
\bar y_b
= \frac{1}{|B_b|}\sum_{i\in B_b}Y_i.
\end{align*}
For instance, suppose the high-risk bin $B_8$ contains five validation loans with predicted probabilities
\begin{align*}
0.72,\quad 0.75,\quad 0.78,\quad 0.81,\quad 0.84
\end{align*}
and default indicators
\begin{align*}
1,\quad 0,\quad 1,\quad 0,\quad 0.
\end{align*}
Then
\begin{align*}
\bar p_8
&=\frac{1}{5}(0.72+0.75+0.78+0.81+0.84)\\
&=\frac{3.90}{5}\\
&=0.78,
\end{align*}
while
\begin{align*}
\bar y_8
&=\frac{1}{5}(1+0+1+0+0)\\
&=\frac{2}{5}\\
&=0.40.
\end{align*}
The plotted point for this bin is therefore $(0.78,0.40)$. Since this point lies below the diagonal line $y=x$, the observed default frequency in this bin is smaller than the average predicted probability, so the model is overpredicting default risk in that range. Conversely, if a low-risk bin had $\bar p_b=0.08$ and $\bar y_b=0.15$, its point would lie above the diagonal because $0.15>0.08$, indicating underprediction among cases labelled low-risk.
[/example]
Calibration does not require the model to be linear in the covariates, and a model can be well calibrated while having weak discrimination. Conversely, a model can rank observations well while producing probabilities that are systematically too large or too small.
## Reproducible Tables And Graphical Summaries
What should be included in a regression report so that another reader can understand, check, and reproduce the analysis? A regression table is not just a container for coefficients; it is a record of the estimand, sample, model specification, uncertainty calculation, and modelling choices.
[definition: Reproducible Regression Table]
A reproducible regression table is a table that reports, for each fitted specification, the outcome, covariates, sample restrictions, coefficient estimates, uncertainty measures, sample size, and any tuning or variance-estimation choices needed to reconstruct the analysis.
[/definition]
Good tables separate estimates from interpretation. They state whether standard errors are classical, heteroskedasticity-robust, clustered, bootstrap-based, or derived from a likelihood approximation. They also identify the unit of observation and the scale of the outcome.
[example: Regression Table For A Policy Analysis]
Suppose the outcome is annual earnings in dollars and the fitted policy regression is
\begin{align*}
Y_i
=
\beta_0+\tau T_i+Z_i^\top\gamma+\delta_{r(i)}+\varepsilon_i,
\end{align*}
where $T_i=1$ if person $i$ participated in the training programme, $T_i=0$ otherwise, $Z_i$ contains demographic controls, and $\delta_{r(i)}$ is the fixed effect for the region $r(i)$. A reproducible table should state that the sample is, for example, working-age adults, report the number of observations $n$, list the included controls and fixed effects, and identify the standard errors as heteroskedasticity-robust or clustered.
Holding $Z_i$ and region fixed, the fitted mean for a treated person is
\begin{align*}
\hat m(1,Z_i,r(i))
&=
\hat\beta_0+\hat\tau\cdot 1+Z_i^\top\hat\gamma+\hat\delta_{r(i)}\\
&=
\hat\beta_0+\hat\tau+Z_i^\top\hat\gamma+\hat\delta_{r(i)}.
\end{align*}
For an otherwise identical untreated person, the fitted mean is
\begin{align*}
\hat m(0,Z_i,r(i))
&=
\hat\beta_0+\hat\tau\cdot 0+Z_i^\top\hat\gamma+\hat\delta_{r(i)}\\
&=
\hat\beta_0+Z_i^\top\hat\gamma+\hat\delta_{r(i)}.
\end{align*}
Therefore the fitted conditional mean difference is
\begin{align*}
\hat m(1,Z_i,r(i))-\hat m(0,Z_i,r(i))
&=
\left(\hat\beta_0+\hat\tau+Z_i^\top\hat\gamma+\hat\delta_{r(i)}\right)
-
\left(\hat\beta_0+Z_i^\top\hat\gamma+\hat\delta_{r(i)}\right)\\
&=
\hat\beta_0+\hat\tau+Z_i^\top\hat\gamma+\hat\delta_{r(i)}
-\hat\beta_0-Z_i^\top\hat\gamma-\hat\delta_{r(i)}\\
&=
\hat\tau.
\end{align*}
Thus the treatment coefficient should be described as the estimated conditional mean earnings difference between participants and nonparticipants under this fitted specification, not as an unconditional earnings gain for every worker.
For instance, a table row might report
\begin{align*}
\hat\tau=1850,
\qquad
\operatorname{SE}_{\mathrm{robust}}(\hat\tau)=620,
\qquad
n=2400.
\end{align*}
Using the normal critical value $1.96$ for an approximate $95\%$ interval, the reported uncertainty interval is
\begin{align*}
\hat\tau\pm 1.96\,\operatorname{SE}_{\mathrm{robust}}(\hat\tau)
&=
1850\pm 1.96\cdot 620\\
&=
1850\pm 1215.2\\
&=
[1850-1215.2,\;1850+1215.2]\\
&=
[634.8,\;3065.2].
\end{align*}
A sensitivity table can then show how this row changes when controls are removed, region fixed effects are added, or standard errors are clustered; the comparison is meaningful because each column states the outcome scale, sample restriction, fitted specification, uncertainty method, and sample size.
[/example]
Graphical summaries complement tables by showing patterns that numbers hide. Residual plots, coefficient plots, partial residual plots, calibration plots, and prediction interval displays each answer a different diagnostic or communication question.
[definition: Coefficient Plot]
A coefficient plot displays point estimates and uncertainty intervals for selected regression coefficients, often across several specifications or subgroups.
[/definition]
Coefficient plots are most useful for comparing estimated associations or contrasts; prediction displays instead put fitted outcomes and predictive uncertainty on the response scale.
When the audience needs to act on predictions, coefficient-scale summaries can hide the quantities that matter: predicted probabilities, expected outcomes, interval width, residual spread, and the comparison with observed data. A prediction summary plot is used to put those pieces on the same response scale while keeping the inferential target explicit.
[definition: Prediction Summary Plot]
A prediction summary plot displays fitted values or predicted probabilities together with uncertainty bands, prediction intervals, residuals, or observed outcomes, depending on the inferential target.
[/definition]
A plot of confidence bands for the mean response should be labelled differently from a plot of prediction intervals for future observations. The visual distinction matters because the latter communicates the variability a user should expect when making an actual prediction.
[remark: Communication Checklist]
A regression report should state the prediction target, the data split or validation method, the loss metric, the fitted specification, the uncertainty method, and the limits of transportability. For probability forecasts, it should report both discrimination when relevant and calibration when probabilities are used for decisions.
[/remark]
## From Model Output To Responsible Claims
How should the analyst decide what claim the regression supports? The strongest defensible statement is usually narrower than the most impressive statement suggested by a fitted model. Prediction accuracy, causal interpretation, and coefficient significance answer different questions.
[explanation: Matching Claims To Evidence]
A low cross-validation error supports a claim about predictive performance for observations drawn from a population similar to the validation data. It does not by itself support a causal claim about interventions. A small $p$-value for a coefficient supports evidence against a null restriction inside the fitted model, but it does not certify that the fitted model predicts well.
Calibration supports the interpretation of predicted probabilities as frequencies in comparable cases. It does not guarantee good ranking, fairness across subgroups, or stability under distribution shift. For high-stakes applications, validation should be repeated across meaningful subgroups and time periods.
[/explanation]
The course began with regression as prediction and projection. It ends by returning to that viewpoint: a regression model is useful only insofar as its assumptions, validation design, uncertainty statements, and communication match the decision or scientific question at hand.
Contents
- Introduction
- What Regression Tries to Estimate
- Why Linear Models Enter
- From Population Targets to Data
- Geometry, Probability, and Inference
- Roadmap of the Course
- 1. Regression as Prediction and Projection
- Prediction With Squared Loss
- Simple Linear Regression As A Population Problem
- Regression To The Mean
- Orthogonal Projection In $L^2$
- Linearity Restrictions And Misspecification
- 2. Ordinary Least Squares in Matrix Form
- Encoding Linear Models with a Design Matrix
- Least Squares and the Normal Equations
- Projection, Hat Matrix, and Leverage
- Orthogonality and Sums of Squares
- 3. Probabilistic Assumptions and Estimation Properties
- Fixed and Random Designs
- Exogeneity and Error Assumptions
- Unbiasedness of Ordinary Least Squares
- Variance and Covariance of the OLS Estimator
- Estimating the Error Variance
- Consistency Under Random Design
- What The Assumptions Buy
- 4. The Gauss-Markov Theorem
- Linear Unbiased Estimators and BLUE
- The Gauss-Markov Theorem
- Equality and Uniqueness in the BLUE Comparison
- Consequences and Limitations
- Known Heteroskedasticity and Weighted Least Squares
- 5. Exact Inference Under Normal Errors
- Gaussian Least Squares Distributions
- Tests and Intervals for Individual Coefficients
- Mean Response and Prediction Intervals
- F-Tests for Nested Models and Linear Restrictions
- Interpreting Exact Inference
- 6. Large-Sample Inference and Robust Standard Errors
- Asymptotic Normality Under Random Design
- Heteroskedasticity-Consistent Covariance Estimation
- Clustered Standard Errors
- Wald Inference With Robust Covariance Matrices
- 7. Model Building and Interpretation
- Scaling, Centering, and Transformed Variables
- Causal and Predictive Interpretation
- Model Selection Criteria and Prediction Error
- 8. Diagnostics and Influence
- Residual Plots and Model Failure
- Leverage and Deleted Residuals
- Influence Measures
- Multicollinearity and Unstable Coefficients
- Diagnostic Workflow
- 9. Beyond OLS: Regularization and High-Dimensional Regression
- Ridge Regression and Shrinkage Along Principal Components
- Lasso Regression and Sparsity
- Bias-Variance Tradeoff and Tuning by Cross-Validation
- Post-Selection Caution in High-Dimensional Regression
- 10. Generalized Linear Models and Logistic Regression
- Modelling Means Through Exponential Families
- Logistic Regression for Binary Responses
- Iteratively Reweighted Least Squares
- Large-Sample Inference and Deviance
- 11. Prediction, Validation, and Communication
- Honest Assessment Through Data Splitting
- Prediction Intervals, Mean Response Intervals, And Residual Uncertainty
- Calibration Of Predicted Probabilities
- Reproducible Tables And Graphical Summaries
- From Model Output To Responsible Claims
Regression and Linear Models
Content
Problems
History
Created by admin on 5/31/2026 | Last updated on 6/1/2026
Prerequisites (0/4 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent