Multivariate Statistical Analysis studies how to model, estimate, and draw conclusions from data with several variables observed at once. Instead of treating each measurement separately, the course focuses on the joint structure of a random vector: how its components vary together, how dependence is encoded, and how that dependence can be exploited for inference, dimension reduction, classification, and prediction. The central language is multivariate probability, with the multivariate normal distribution serving as the foundational model from which many classical procedures are developed.
The chapters build in a natural progression from probabilistic foundations to statistical methods. After introducing the multivariate normal distribution and random sampling from it, the course develops the Wishart distribution and uses it to study covariance estimation and Hotelling’s $T^2$ tests for multivariate means. It then moves to exploratory and structural methods such as principal component analysis, factor analysis, canonical correlation analysis, and linear discriminant analysis, followed by multivariate analysis of variance and the multivariate linear model with generalised least squares. The later chapters extend the classical theory to high-dimensional settings and broaden the model class beyond normality through copulas and other dependence constructions.
Overall, the course connects theory and application: classical matrix-based inference, geometric interpretations of multivariate data, and modern approaches for complex dependence structures. The goal is to give a coherent framework for understanding multivariate data analysis, from exact distributional results in the normal model to contemporary methods that remain effective when dimensionality is high or Gaussian assumptions are too restrictive.
# Introduction
These notes study statistical models in which each observation carries several coordinates at once. Instead of treating the coordinates separately, multivariate analysis asks how their joint variation, dependence, and geometry determine inference. The course begins with the multivariate normal distribution, builds its sampling theory through Wishart and Hotelling distributions, and then develops the classical dimension-reduction and classification methods that arise from covariance structure.
The guiding theme is that many multivariate procedures are linear algebra applied to random vectors. Means become points in $\mathbb R^p$, covariance matrices become positive semidefinite linear maps, and statistical questions become questions about projections, eigenspaces, ranks, and quadratic forms. This introductory chapter fixes the objects, notation, and recurring questions that will appear throughout the course.
# 1. Foundations of Multivariate Data
This chapter sets up the language used throughout the course: random vectors, covariance matrices, multivariate samples, and the normal model. The main aim is to replace coordinate-by-coordinate thinking with geometric statements about directions, quadratic forms, and linear transformations. Chapters 2-5 specialise these objects into exact normal, Wishart, and Hotelling distributions, while Chapters 6-11 turn them into statistical procedures.
## What Changes in Several Dimensions?
A univariate sample has one direction in which it can vary. A multivariate sample has many directions, and the choice of direction is part of the statistical problem. If $X=(X_1,\dots,X_p)$ is a random vector, then the coordinate variables $X_i$ are only one possible set of one-dimensional summaries; every linear combination $a^\top X$ is another.
[definition: Random Vector]
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. A random vector in $\mathbb R^p$ is a measurable map $X:(\Omega,\mathcal F)\to (\mathbb R^p,\mathcal B(\mathbb R^p))$.
[/definition]
The measurability condition means that probabilities of events such as $X\in A$ are defined for all Borel sets $A\subset \mathbb R^p$. Once $X$ is regarded as a map into $\mathbb R^p$, its distribution is a probability measure on Euclidean space rather than a list of $p$ separate marginal distributions.
To compare such distributions, we first need a vector-valued notion of location that records the average position of the whole random point, not just isolated coordinate summaries. A coordinate-by-coordinate list of expectations is useful only after it is packaged as an object that transforms like the random vector itself. This packaging lets later formulas express centering, covariance, and normal densities without choosing a special coordinate as primary.
[definition: Mean Vector]
Let $X=(X_1,\dots,X_p)$ be a random vector with $\mathbb E[|X_i|]<\infty$ for each $i$. The mean vector of $X$ is
\begin{align*}
\mu=\mathbb E[X]:=(\mathbb E[X_1],\dots,\mathbb E[X_p])^\top\in\mathbb R^p.
\end{align*}
[/definition]
The mean vector locates the centre of the distribution, but it does not describe orientation, spread, or dependence. Two bivariate distributions can have the same mean while varying along very different directions.
The next parameter must measure variation in every linear direction at once. Treating the coordinates separately would record only marginal variances and would miss directional dependence, so the second-order object is the covariance matrix.
[definition: Covariance Matrix]
Let $X$ be a random vector in $\mathbb R^p$ with $\mathbb E[|X_i|^2]<\infty$ for each $i$, and let $\mu=\mathbb E[X]$. The covariance matrix of $X$ is
\begin{align*}
\Sigma=\operatorname{Cov}(X):=\mathbb E[(X-\mu)(X-\mu)^\top]\in\mathbb R^{p\times p}.
\end{align*}
[/definition]
Its entries are $\Sigma_{ij}=\operatorname{Cov}(X_i,X_j)$. The diagonal entries are variances, while the off-diagonal entries measure pairwise linear dependence between coordinates.
Before using covariance matrices as parameters, we need to know what structural restrictions they always satisfy. Directional variance gives the key test: for any vector $a$, the scalar random variable $a^\top X$ has variance $a^\top\Sigma a$, so this quadratic form cannot be negative. The theorem records this intrinsic symmetry and non-negativity, and it explains why covariance geometry is governed by positive semidefinite matrices rather than arbitrary square matrices.
[quotetheorem:3999]
[citeproof:3999]
This theorem explains why positive semidefinite matrices are the natural parameter space for covariance matrices. The finite second-moment hypothesis is essential: for a heavy-tailed scalar variable such as a Cauchy random variable, the covariance is not defined, so no finite matrix can encode second-order spread. Symmetry and positive semidefiniteness are necessary conditions only; the theorem does not say that a covariance matrix determines the whole distribution, since many non-normal distributions share the same mean and covariance. The forward connection is that the normal likelihood in Chapter 2, the Wishart distribution in Chapter 4, and the root decompositions in MANOVA all rely on this positive semidefinite geometry. In particular, the variance in direction $a$ is the quadratic form $a^\top\Sigma a$.
[example: Two-Dimensional Covariance Geometry]
Let $X=(X_1,X_2)$ have mean $0$ and covariance matrix
\begin{align*}
\Sigma=\begin{pmatrix}4&1\\1&1\end{pmatrix}.
\end{align*}
For a unit vector $a=(\cos\theta,\sin\theta)$, the variance of $a^\top X$ is $a^\top\Sigma a$. Diagonalising $\Sigma$ gives two orthogonal directions of maximal and minimal variance, so the covariance matrix determines an ellipse of typical spread whose axes are eigenvectors of $\Sigma$.
[/example]
## The Statistical Model Behind the Course
What should be assumed about repeated multivariate measurements? The basic setting is an independent sample $X_1,\dots,X_n$ from a common distribution on $\mathbb R^p$. The aim is to infer population quantities such as a mean vector, covariance matrix, regression relation, or group separation rule from the sample.
[definition: Multivariate Sample]
A multivariate sample of size $n$ and dimension $p$ is a sequence $X_1,\dots,X_n$ of random vectors $X_i:(\Omega,\mathcal F)\to(\mathbb R^p,\mathcal B(\mathbb R^p))$.
[/definition]
In most of the course the vectors are assumed independent and identically distributed. The data can then be arranged as an $n\times p$ matrix $X$, whose $i$-th row is the observation $X_i^\top$.
Once observations are organised as repeated vectors, the first task is to compress their common location without separating the coordinates into unrelated univariate problems. The empirical centre should remain a vector in the same space as each observation, because later residuals and covariance estimates are formed by subtracting this centre from every data point.
This requires a coordinate-free summary of location that averages whole observation vectors at once, rather than averaging each variable as an isolated scalar. The empirical centre must be stable under changes of coordinates and must live in the same vector space as the observations.
[definition: Sample Mean Vector]
Let $X_1,\dots,X_n$ be random vectors in $\mathbb R^p$. The sample mean vector is
\begin{align*}
\bar X=\frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
[/definition]
A mean alone cannot show whether the observed cloud is elongated, tilted, or nearly confined to a lower-dimensional subspace. To estimate the population covariance, we need a matrix built from centred residual vectors rather than from raw coordinates.
[definition: Sample Covariance Matrix]
Let $X_1,\dots,X_n$ be random vectors in $\mathbb R^p$, and let $\bar X$ be their sample mean vector. The sample covariance matrix is
\begin{align*}
S=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)(X_i-\bar X)^\top.
\end{align*}
[/definition]
The factor $n-1$ is the multivariate version of the usual degrees-of-freedom correction. The matrix $S$ records the empirical spread in every direction, since $a^\top S a$ is the sample variance of the scalar observations $a^\top X_1,\dots,a^\top X_n$.
Before using $S$ as an estimator of $\Sigma$, we need to check that this correction has the intended target in the multivariate setting. The relevant question is whether the matrix expectation of $S$ equals the population covariance, not merely whether each entry resembles a familiar univariate variance estimate.
[quotetheorem:4000]
[citeproof:4000]
This result is the entry point to sampling theory. It says that the empirical covariance is centred correctly at the population covariance, so the matrix $S$ is not systematically too large or too small once the degrees-of-freedom correction is included. The conclusion is about expectation, not about a guarantee in one realised sample: a small sample can still give a noisy or nearly singular covariance estimate. Later chapters refine it under normality, where the distribution of $(n-1)S$ becomes Wishart and separates from the distribution of $\bar X$.
## Why the Multivariate Normal Distribution Is Central
Which probability model is tractable enough to support exact inference and rich enough to encode dependence? The multivariate normal distribution plays this role because it is stable under affine transformations, has explicit marginal and conditional distributions, and converts many quadratic forms into chi-squared laws.
[definition: Multivariate Normal Distribution]
Let $\mu\in\mathbb R^p$ and let $\Sigma\in\mathbb R^{p\times p}$ be symmetric positive semidefinite. A random vector $X$ in $\mathbb R^p$ has multivariate normal distribution with mean vector $\mu$ and covariance matrix $\Sigma$ if every linear combination $a^\top X$, with $a\in\mathbb R^p$, is a univariate normal random variable with mean $a^\top\mu$ and variance $a^\top\Sigma a$. We write $X\sim\mathcal N_p(\mu,\Sigma)$.
[/definition]
This definition includes singular covariance matrices. When $\Sigma$ is positive definite, the distribution also has a density with ellipsoidal contours centred at $\mu$.
The defining property is expressed through all linear projections, so the first structural question is whether those projections remain normal after applying a matrix transformation. This matters because most later statistics are built by centring, rotating, selecting coordinates, or projecting residuals. The theorem gives the closure rule that allows these operations to be treated as changes of coordinates inside the same normal family.
[quotetheorem:4001]
[citeproof:4001]
Affine stability explains why matrix methods fit naturally with normal theory. The normality hypothesis is essential: a non-normal random vector may have finite mean and covariance, but a projection or residual can have a distribution outside any tractable named family. The theorem does not say that arbitrary nonlinear transformations preserve normality; for example, if $Z\sim\mathcal N(0,1)$, then $Z^2$ is not normal. Its role in the course is to justify reducing multivariate normal calculations to projections, rotations, coordinate selections, standardisations, and residual-forming maps without leaving the normal family.
[example: Standardising a Normal Vector]
Let $X\sim\mathcal N_p(\mu,\Sigma)$ with $\Sigma$ positive definite. If $\Sigma=Q\Lambda Q^\top$ is an eigendecomposition with $Q$ orthogonal and $\Lambda$ diagonal with positive entries, define $Y=\Lambda^{-1/2}Q^\top(X-\mu)$. Then $Y\sim\mathcal N_p(0,I_p)$, so the transformation rotates to principal axes and rescales each coordinate to unit variance.
[/example]
## The Four Classical Techniques
How do we use covariance structure to simplify, interpret, and classify multivariate data? Looking only at the original coordinates can be misleading because the strongest signal may lie in a rotated direction, not in any single measured variable. The middle part of the course studies four classical methods: principal component analysis, factor analysis, canonical correlation analysis, and linear discriminant analysis. Each method asks for a linear representation that optimises a statistical criterion.
[definition: Principal Component Direction]
Let $X$ be a random vector in $\mathbb R^p$ with covariance matrix $\Sigma$. A first principal component direction is a vector $a_1\in\mathbb R^p$ satisfying $|a_1|=1$ and
\begin{align*}
a_1^\top\Sigma a_1=\max_{|a|=1} a^\top\Sigma a.
\end{align*}
[/definition]
The first principal component direction finds the one-dimensional projection with maximal variance. Further principal components add orthogonality constraints and are obtained from the eigenspaces of $\Sigma$.
A different dimension-reduction problem asks for hidden variables that explain the dependence among many measured coordinates. This requires a model that separates shared latent variation from coordinate-specific noise.
[definition: Factor Model]
A $k$-factor model for a random vector $X\in\mathbb R^p$ is a representation
\begin{align*}
X=\mu+\Lambda F+\varepsilon,
\end{align*}
where $\mu\in\mathbb R^p$, $\Lambda\in\mathbb R^{p\times k}$, $F\in\mathbb R^k$ is a latent random vector, and $\varepsilon\in\mathbb R^p$ is an error random vector satisfying
\begin{align*}
\mathbb E[F]&=0, & \operatorname{Cov}(F)&=I_k, & \mathbb E[\varepsilon]&=0, & \operatorname{Cov}(F,\varepsilon)&=0,
\end{align*}
and
\begin{align*}
\operatorname{Cov}(\varepsilon)=\Psi,
\end{align*}
where $\Psi$ is diagonal with nonnegative entries.
[/definition]
Factor analysis starts from the idea that many observed coordinates may be driven by a smaller number of latent variables. Unlike PCA, which is primarily a variance decomposition, factor analysis imposes a probabilistic model with latent factors and coordinate-specific errors.
When the variables arrive in two natural blocks, the problem is no longer to explain one covariance matrix internally. We instead need a measure of how strongly some linear summary of one block can be associated with a linear summary of the other.
[definition: Canonical Correlation]
Let $X\in\mathbb R^p$ and $Y\in\mathbb R^q$ be random vectors with finite second moments. A first canonical correlation is the maximum correlation between scalar projections $a^\top X$ and $b^\top Y$ over nonzero $a\in\mathbb R^p$ and $b\in\mathbb R^q$ for which the variances are positive.
[/definition]
Canonical correlation analysis studies dependence between two blocks of variables rather than variation within one block. It turns cross-covariance information into paired directions that best align the two data views.
For classification, the objective changes from describing variation to assigning a new observation to a group. To make this objective mathematically explicit, we need a rule that converts a feature vector into a class label by comparing group scores. Restricting the scores to be linear gives the simplest decision mechanism compatible with covariance-based Gaussian modelling.
[definition: Linear Discriminant Rule]
Let $G$ be a class label taking values in $\{1,\dots,K\}$, and let $X\in\mathbb R^p$ be a feature vector. A linear discriminant rule is a map
\begin{align*}
d:\mathbb R^p\to\{1,\dots,K\}
\end{align*}
for which there exist vectors $a_k\in\mathbb R^p$ and constants $c_k\in\mathbb R$ such that
\begin{align*}
\delta_k(x)=a_k^\top x+c_k,
\end{align*}
and $d(x)$ is any class index attaining $\max_{1\le k\le K}\delta_k(x)$, with a specified tie-breaking convention if the maximum is not unique.
[/definition]
Linear discriminant analysis derives these scores from Gaussian class-conditional models with a common covariance matrix. The resulting decision boundaries are hyperplanes, so classification again becomes a problem in covariance geometry.
## MANOVA as a Unifying Framework
What replaces the two-sample $t$-test or one-way ANOVA when the response is vector-valued? MANOVA compares mean vectors across groups while accounting for covariance among response coordinates. Its test statistics are built from two positive semidefinite matrices: one measuring variation between fitted group means, and one measuring residual variation within groups.
[definition: Multivariate Linear Model]
A multivariate linear model has the matrix form
\begin{align*}
Y=XB+E,
\end{align*}
where $Y\in\mathbb R^{n\times p}$ is the response matrix, $X\in\mathbb R^{n\times r}$ is the design matrix, $B\in\mathbb R^{r\times p}$ is the coefficient matrix, and $E\in\mathbb R^{n\times p}$ is the error matrix.
[/definition]
This model includes one-sample mean testing, two-sample comparison, one-way MANOVA, and multivariate regression. Hypotheses become linear constraints on $B$.
To turn those constraints into test statistics, we must separate variation explained by the tested hypothesis from residual variation left after fitting the full model. In the multivariate setting both quantities are matrices because the response coordinates can vary together. The formal definitions below identify these two sources of variation from the restricted and unrestricted fitted values, giving the matrix pair on which MANOVA statistics operate.
[definition: Hypothesis And Error Matrices]
Consider the multivariate linear model $Y=XB+E_0$ with full fitted value matrix $\hat Y$ and residual matrix $R=Y-\hat Y$. For a nested null hypothesis whose fitted value matrix is $\hat Y_0$, the hypothesis and error matrices are
\begin{align*}
H&=(\hat Y-\hat Y_0)^\top(\hat Y-\hat Y_0), & E&=R^\top R.
\end{align*}
[/definition]
The same matrix pair $(H,E)$ leads to Wilks lambda, Pillai trace, Hotelling-Lawley trace, and Roy largest root. Defining $H$ and $E$ through fitted and residual matrices matters because scalar sums of squares cannot remember how different response coordinates co-vary. The null model is essential: without specifying the restricted fitted values $\hat Y_0$, there is no well-defined notion of variation attributable to the tested hypothesis. These statistics differ in how they combine the eigenvalues, but each measures the size of hypothesis variation relative to error variation.
## Recurring Mathematical Tools
Which background results will be used repeatedly? The course relies on spectral decompositions, singular value decompositions, Schur complements, quadratic forms, moment generating functions, and likelihood calculations. The goal is not to collect isolated formulae, but to see how a small set of tools controls many multivariate procedures.
[definition: Positive Definite Matrix]
A symmetric matrix $A\in\mathbb R^{p\times p}$ is positive definite if
\begin{align*}
x^\top A x>0
\end{align*}
for every nonzero $x\in\mathbb R^p$.
[/definition]
Positive definite covariance matrices describe distributions with nonzero variance in every direction. Positive semidefinite covariance matrices allow exact linear constraints and lead to the singular normal case.
[definition: Schur Complement]
Let
\begin{align*}
M=\begin{pmatrix}A&B\\C&D\end{pmatrix}
\end{align*}
be a block matrix with $D$ invertible. The Schur complement of $D$ in $M$ is
\begin{align*}
M/D:=A-BD^{-1}C.
\end{align*}
[/definition]
Schur complements will appear in conditional normal distributions and block matrix inversions. They express the residual covariance left in one block after linearly predicting it from another block.
The statistical reason for introducing this block operation is that best linear prediction has an error covariance with exactly the same form. Establishing that connection now explains why Schur complements later appear inside conditional covariance formulae.
[quotetheorem:4002]
[citeproof:4002]
This theorem foreshadows the conditional covariance formula for the multivariate normal distribution. The positive definiteness of $\operatorname{Cov}(Y)$ is needed for the displayed inverse; if one coordinate of $Y$ is an exact linear combination of the others, the normal equations need not have a unique coefficient matrix. The theorem does not claim that the best predictor among all measurable functions is linear, and outside the Gaussian setting nonlinear predictors can improve on it. In the Gaussian case, the best linear predictor is not merely a least-squares approximation; it is the conditional expectation, which is why Schur complements reappear in conditional normal theory.
## Reading the Course as a Whole
The first third of the course develops exact distribution theory. The multivariate normal distribution supplies the model, the Wishart distribution describes sample covariance matrices, and Hotelling $T^2$ gives the multivariate analogue of Student $t$ statistic.
The second third turns the distribution theory into methods for structure discovery. PCA extracts dominant variance directions, factor analysis proposes latent causes, and canonical correlation analysis compares two coordinate blocks.
The final third studies classification and grouped responses. Linear discriminant analysis gives Gaussian classification rules, and MANOVA organises multivariate hypothesis testing through matrix decompositions.
Throughout, the same question recurs: which features of a high-dimensional random vector are invariant under a change of coordinates, and which features depend on the chosen measurement scale? Keeping this distinction in view makes the formal calculations easier to interpret and prevents the methods from becoming a catalogue of unrelated matrix formulae.
Having separated what is invariant from what depends on scale, we are ready to choose a concrete model where those invariances can be exploited rather than ignored. The multivariate normal distribution provides that model: it keeps the algebra manageable while still capturing means, covariances, and linear transformations in a unified way.
# 2. The Multivariate Normal Distribution
A central problem in multivariate statistics is to model several real-valued measurements at once while retaining enough algebraic structure to compute marginals, conditionals, and transformations. The multivariate normal distribution is the matrix-valued extension of the one-dimensional normal law: it is stable under affine maps, has tractable conditional distributions, and converts covariance calculations into independence statements. This chapter develops the distribution from its density and characteristic function, then derives the formulas that make it useful in regression, quadratic forms, Wishart distributions, sample covariance matrices, and later inference.
## Describing Joint Gaussian Variation
How should a probability law on $\mathbb R^p$ encode both the centre of a cloud of observations and the directions in which the cloud varies most? The answer is a mean vector $\mu$ and a covariance matrix $\Sigma$, but the density formula only works directly when $\Sigma$ is invertible.
[definition: Positive Semidefinite Covariance Matrix]
A matrix $\Sigma \in \mathbb R^{p \times p}$ is a positive semidefinite covariance matrix if $\Sigma = \Sigma^\top$ and
\begin{align*}
a^\top \Sigma a \ge 0
\end{align*}
for every $a \in \mathbb R^p$.
[/definition]
The inequality says that every linear combination $a^\top X$ has non-negative variance when $\Sigma$ is used as a covariance matrix. If the inequality is strict for every non-zero $a$, then $\Sigma$ is positive definite.
The positive definite case is the one in which the Gaussian law spreads through all of $\mathbb R^p$. This is the setting where a single density can describe probabilities by integrating over ordinary regions of Euclidean space. The determinant and inverse of the covariance matrix then have concrete roles: the determinant normalises volume, while the inverse measures covariance-scaled distance from the mean.
To use this model in likelihoods, confidence regions, and exact sampling distributions, we need a precise density with its normalising constant and quadratic exponent. The non-singular multivariate normal distribution supplies that reference law.
[definition: Non-Singular Multivariate Normal Distribution]
Let $p \in \mathbb N$, let $\mu \in \mathbb R^p$, and let $\Sigma \in \mathbb R^{p \times p}$ be positive definite. A random vector $X: \Omega \to \mathbb R^p$ has the non-singular multivariate normal distribution with mean $\mu$ and covariance matrix $\Sigma$, written $X \sim \mathcal N_p(\mu, \Sigma)$, if its density is the function $f_X: \mathbb R^p \to [0,\infty)$ defined by
\begin{align*}
f_X(x) = (2\pi)^{-p/2}(\det \Sigma)^{-1/2}\exp\left(-\frac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\right), \qquad x \in \mathbb R^p.
\end{align*}
[/definition]
The exponent is the squared distance from $x$ to $\mu$ measured in the metric determined by $\Sigma^{-1}$. Directions with large variance are penalised less strongly, so density contours are ellipsoids aligned with the eigenspaces of $\Sigma$.
[example: Bivariate Normal Contours]
Let $X=(X_1,X_2)^\top \sim \mathcal N_2(0,\Sigma)$ with
\begin{align*}
\Sigma = \begin{pmatrix}1 & \rho\\ \rho & 1\end{pmatrix}, \qquad -1<\rho<1.
\end{align*}
The contours of the density are the ellipses $x^\top\Sigma^{-1}x=c$. When $\rho>0$, the long axis is parallel to the line $x_1=x_2$; when $\rho<0$, it is parallel to $x_1=-x_2$. Diagonalising $\Sigma$ gives eigenvalues $1+\rho$ and $1-\rho$, so the axis lengths are proportional to $\sqrt{1+\rho}$ and $\sqrt{1-\rho}$.
[/example]
The density is not the most general description because some Gaussian laws live on lower-dimensional affine subspaces.
To include both full-dimensional and singular cases without changing notation, we need a definition that does not require a determinant, inverse, or Lebesgue density. The characteristic function provides that unified description.
[definition: Multivariate Normal Distribution]
Let $p \in \mathbb N$, let $\mu \in \mathbb R^p$, and let $\Sigma \in \mathbb R^{p \times p}$ be positive semidefinite. A random vector $X: \Omega \to \mathbb R^p$ has the multivariate normal distribution with mean $\mu$ and covariance matrix $\Sigma$, written $X \sim \mathcal N_p(\mu,\Sigma)$, if its characteristic function is
\begin{align*}
\phi_X(t) = \mathbb E[e^{it^\top X}] = \exp\left(it^\top\mu - \frac{1}{2}t^\top\Sigma t\right), \qquad t \in \mathbb R^p.
\end{align*}
[/definition]
This definition includes the non-singular density case and also the singular case. If $\operatorname{rank}(\Sigma)=r<p$, the distribution is concentrated on the affine subspace $\mu + \operatorname{Range}(\Sigma)$ and has no density with respect to $\mathcal L^p$.
We still need to verify that the two descriptions agree when both are available. In the positive definite case, the density formula should produce exactly the characteristic function used in the general definition.
[quotetheorem:4003]
[citeproof:4003]
Positive definiteness is needed here because the proof uses an ordinary density on $\mathbb R^p$, the inverse matrix $\Sigma^{-1}$, and the determinant factor $(\det\Sigma)^{-1/2}$. If $\Sigma$ is singular, for instance $X=(Z,Z)^\top$ with $Z\sim\mathcal N(0,1)$, the law is supported on the line $x_1=x_2$ and no density with respect to $\mathcal L^2$ exists. The theorem identifies the density and characteristic-function definitions in the non-singular case, but it does not by itself construct singular Gaussian laws; the characteristic-function definition is needed for that. This is why the characteristic function viewpoint is used throughout the chapter: it turns transformations and independence statements into matrix calculations without repeatedly separating singular and non-singular cases.
## Affine Transformations
If data are centred, projected, rescaled, or combined into contrasts, recomputing the resulting density by integration is quickly impractical, and it can even fail when the transformation collapses dimension. A projection from $\mathbb R^p$ to a lower-dimensional subspace produces a singular covariance matrix when it is viewed in a larger ambient space. The key structural fact is that affine transformations preserve normality and that characteristic functions record the transformed mean and covariance directly.
The same characteristic-function calculation also explains two related facts that will be used later rather than proved from densities each time. First, if a random perturbation is asymptotically negligible, then adding it to an asymptotically normal vector preserves the same limiting normal law; this is the multivariate Slutsky-type stability principle behind many large-sample approximations. Second, for a Gaussian vector, a diagonal covariance matrix is not merely uncorrelatedness: it forces coordinate independence. These extensions belong here because they are all consequences of how Gaussian characteristic functions encode linear transformations, limits, and covariance structure.
[quotetheorem:1853]
[citeproof:1853]
The matrix dimensions and linearity hypotheses are essential: $A\Sigma A^\top$ is exactly the covariance of $AX$, and no comparable formula holds for a general nonlinear transformation such as $X\mapsto X^2$. The theorem also shows that singular covariance matrices are unavoidable: if $A$ has rank smaller than $q$, then $A\Sigma A^\top$ may be singular even when $\Sigma$ is positive definite. What the theorem does not say is that every transformation of a Gaussian vector is Gaussian; it applies only to affine maps. Its main use in the next section is to make coordinate projections, contrasts, and marginal distributions immediate consequences of one calculation.
[example: Linear Contrast]
Let $X=(X_1,X_2,X_3)^\top \sim \mathcal N_3(\mu,\Sigma)$ and consider the contrast $Y=X_1-X_2$. With $A=(1,-1,0)$, the affine transformation theorem gives
\begin{align*}
Y \sim \mathcal N_1(\mu_1-\mu_2,\; \Sigma_{11}+\Sigma_{22}-2\Sigma_{12}).
\end{align*}
This computes the distribution of a difference of two correlated measurements without integrating out the third coordinate.
[/example]
## Marginal And Conditional Distributions
Statistical prediction asks for the distribution of unobserved coordinates after observed coordinates have been fixed. For a multivariate normal vector, both marginalisation and conditioning preserve normality, and the conditional mean is an affine function of the observed value.
Write a partitioned vector and covariance matrix as
\begin{align*}
X = \begin{pmatrix}X_1\\X_2\end{pmatrix}, \qquad
\mu = \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \qquad
\Sigma = \begin{pmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{pmatrix},
\end{align*}
where $X_1\in\mathbb R^{p_1}$, $X_2\in\mathbb R^{p_2}$, and $p_1+p_2=p$. Assume in the conditional theorem that $\Sigma_{22}$ is positive definite.
Coordinate projection is a special affine transformation: selecting $X_1$ from $(X_1,X_2)$ is multiplication by a block projection matrix. Therefore the affine-stability theorem from the previous section gives $X_1\sim\mathcal N_{p_1}(\mu_1,\Sigma_{11})$ and $X_2\sim\mathcal N_{p_2}(\mu_2,\Sigma_{22})$. This projection argument does not give conditional laws, since discarding $X_2$ is different from observing a value of $X_2$; its role is to isolate the covariance blocks that survive marginalisation, which are precisely the blocks that enter the conditional formula next.
Marginals discard coordinates. Conditional distributions keep the observed coordinate and update the distribution of the remaining coordinate through the cross-covariance matrix. The obstruction is that the original block $\Sigma_{11}$ still includes variation in $X_1$ that can be linearly predicted from $X_2$. The conditional covariance should remove that predictable part, leaving only the residual spread after the observed block has been accounted for.
[definition: Schur Complement Conditional Covariance]
For a block covariance matrix with $\Sigma_{22}$ positive definite, define
\begin{align*}
\Sigma_{1\mid 2} := \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.
\end{align*}
[/definition]
The matrix $\Sigma_{1\mid 2}$ is the Schur complement of $\Sigma_{22}$ in $\Sigma$. It is the covariance remaining in $X_1$ after the best linear prediction from $X_2$ has been removed.
[quotetheorem:4004]
[citeproof:4004]
Positive definiteness ensures that the joint density exists and that the Schur complement is positive definite; the assumption on $\Sigma_{22}$ ensures that the observed block has an invertible covariance matrix. If $\Sigma_{22}$ is singular, conditioning on $X_2=x_2$ may require restricting to the support of $X_2$ and replacing inverses by a more delicate singular Gaussian conditioning formula. The theorem is also special to joint normal vectors: outside the Gaussian family, conditional means need not be affine and conditional variances need not be constant. The regression example below shows how the matrix formula becomes the familiar straight-line conditional mean in the bivariate case.
[example: Bivariate Normal Regression]
Let $(X,Y)^\top$ be bivariate normal with means $\mu_X,\mu_Y$, variances $\sigma_X^2,\sigma_Y^2$, and correlation $\rho$. For $\sigma_X^2>0$, the conditional law of $Y$ given $X=x$ is
\begin{align*}
Y\mid X=x \sim \mathcal N\left(\mu_Y+\rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X),\; \sigma_Y^2(1-\rho^2)\right).
\end{align*}
Thus the regression function $\mathbb E[Y\mid X=x]$ is a straight line, and the conditional variance does not depend on $x$. The slope is positive, negative, or zero according to the sign of the covariance.
[/example]
## Independence And Zero Covariance
In general probability theory, uncorrelated random variables need not be independent. The multivariate normal family is exceptional: for jointly normal components, covariance blocks contain all information about independence.
[quotetheorem:4005]
[citeproof:4005]
The joint normal hypothesis is essential: zero covariance alone does not imply independence for arbitrary random variables, as the next example demonstrates. The block condition is also stronger than a statement about a single pair of coordinates when $X_1$ and $X_2$ are vectors; every cross-covariance entry must vanish. The theorem does not say that pairwise independence among many variables always gives mutual independence outside the jointly normal setting. Its forward role is to justify treating uncorrelated Gaussian contrasts as independent pieces in quadratic-form and sampling-distribution calculations.
[example: Uncorrelated But Not Independent Outside The Normal Family]
Let $Z\sim\mathcal N(0,1)$ and let $Y=Z^2$. Then $\operatorname{Cov}(Z,Y)=\mathbb E[Z^3]-\mathbb E[Z]\mathbb E[Z^2]=0$, but $Z$ and $Y$ are not independent because $Y$ is determined by $Z$. This example shows that zero covariance implies independence only under a joint normal assumption.
[/example]
For the conditional distribution, the same block condition has a second interpretation. If $\Sigma_{12}=0$, then $\mu_{1\mid 2}=\mu_1$ and $\Sigma_{1\mid 2}=\Sigma_{11}$, so observing $X_2$ does not change the law of $X_1$.
## Moment Generating Functions And Quadratic Forms
How can the mean vector, covariance matrix, and later test statistics be recovered without returning to multidimensional integrals? Direct integration against the Gaussian density becomes cumbersome once products of several coordinates or quadratic forms appear. Moment generating functions compress moment calculations into differentiation, while quadratic forms connect the geometry of ellipsoids to chi-squared distributions used later in inference.
[quotetheorem:4006]
[citeproof:4006]
The Gaussian assumption is what makes the moment generating function finite for every $t\in\mathbb R^p$ and quadratic in $t$; many distributions either have no moment generating function near all of $\mathbb R^p$ or have higher-order terms. Positive semidefiniteness is enough because a singular Gaussian can be represented as an affine image of a lower-dimensional standard normal vector. The formula identifies moments and cumulants, but it does not characterise arbitrary laws unless the moment generating function exists on a neighbourhood of $0$. Taking logarithms gives the cumulant generating function
\begin{align*}
K_X(t)=\log M_X(t)=t^\top\mu+\frac{1}{2}t^\top\Sigma t.
\end{align*}
Thus all cumulants of order at least three vanish. The first derivative at $0$ gives $\mu$, and the Hessian at $0$ gives $\Sigma$.
[example: Recovering A Covariance From The MGF]
For $X\sim\mathcal N_p(\mu,\Sigma)$, differentiating $M_X(t)$ gives
\begin{align*}
\mathbb E[X_iX_j] = \frac{\partial^2 M_X}{\partial t_i\partial t_j}(0)=\Sigma_{ij}+\mu_i\mu_j.
\end{align*}
Hence $\operatorname{Cov}(X_i,X_j)=\Sigma_{ij}$. This verifies that the parameter called $\Sigma$ in the distribution is the covariance matrix.
[/example]
The normal density describes ellipsoids geometrically, but inference needs a probability scale for those ellipsoids. The relevant scalar is the squared Mahalanobis length of a centred normal vector, because it combines all coordinates after whitening them to unit covariance. Identifying its distribution turns geometric distance from the mean into chi-squared critical values.
[quotetheorem:4007]
[citeproof:4007]
Positive definiteness is needed so that $\Sigma^{-1}$ and $\Sigma^{-1/2}$ exist; if $\Sigma$ is singular, the quadratic form must be restricted to the range of $\Sigma$ or written using a generalized inverse, and the degrees of freedom become the rank rather than $p$. Centering is also essential for the central chi-squared conclusion: if $X\sim\mathcal N_p(\mu,\Sigma)$ with $\mu\ne 0$, the corresponding quadratic form has a noncentral chi-squared distribution after whitening. The theorem therefore calibrates centred ellipsoids in the non-singular case, but it does not cover every Gaussian quadratic form. This is the algebra behind confidence ellipsoids for a mean vector and appears again in the Wishart and Hotelling distributions.
[example: Probability Inside A Normal Ellipsoid]
Let $X\sim\mathcal N_p(\mu,\Sigma)$ with $\Sigma$ positive definite. The ellipsoid
\begin{align*}
E_c=\{x\in\mathbb R^p:(x-\mu)^\top\Sigma^{-1}(x-\mu)\le c\}
\end{align*}
has probability $\mathbb P(X\in E_c)=\mathbb P(\chi^2_p\le c)$. Choosing $c$ as the $0.95$ quantile of $\chi^2_p$ therefore gives a $95\%$ probability ellipsoid under the model.
[/example]
The chapter has built the multivariate normal distribution as a matrix-stable probability model. The next part of the course uses independent normal samples to derive the Wishart distribution, which describes the random sample covariance matrix and supplies the sampling theory for multivariate inference.
With the distribution $\mathcal N_p(\mu,\Sigma)$ now in place, the natural next question is how its parameters are reflected in actual data. Independent normal samples let us move from population-level properties to the sampling behavior of estimators, especially the mean vector and covariance matrix.
# 3. Random Samples from the Multivariate Normal
This chapter asks what information about $(\mu, \Sigma)$ is contained in an independent sample from a multivariate normal population. Chapter 2 established the algebra of the distribution $\mathcal N_p(\mu,\Sigma)$: affine transformations, marginal and conditional laws, independence through covariance, and quadratic forms. We now turn those facts into sampling theory: the sample mean estimates location, the sample covariance estimates scatter, and the normal model gives an unusually strong separation between the two.
The central theme is that Gaussian samples split orthogonally into a mean component and a residual component. This decomposition is the multivariate version of the normal sampling theorem from one-variable statistics. It leads to independence of $\bar X$ and $S$, maximum likelihood estimators for $(\mu,\Sigma)$, and sufficiency of the pair $(\bar X,S)$.
## Estimating Location and Scatter from a Random Sample
Given random vectors $X_1,\dots,X_n$, the first question is how to summarize their central location and spread without discarding the geometry of the coordinates. In one dimension the answers are the sample mean and sample variance; in $p$ dimensions the mean becomes a vector and the variance becomes a covariance matrix.
[definition: Multivariate Random Sample]
Let $X_1,\dots,X_n$ be random vectors with values in $\mathbb R^p$. They form a random sample from $\mathcal N_p(\mu,\Sigma)$ if $X_1,\dots,X_n$ are independent and each $X_i \sim \mathcal N_p(\mu,\Sigma)$.
[/definition]
We often arrange the sample as an $n \times p$ data matrix $X$, whose $i$th row is $X_i^\top$. The row viewpoint is convenient for likelihoods, while the vector viewpoint is better for covariance calculations. A statistic should be understood as a function of the whole observed sample: before seeing the data it is a random variable, and after seeing the data it is a deterministic summary. This distinction matters because later sufficiency and likelihood arguments ask exactly which functions of the data retain information about $(\mu,\Sigma)$.
[definition: Sample Mean Vector]
For $n\ge 1$, the sample mean vector is the statistic
\begin{align*}
\bar X:(\mathbb R^p)^n &\to \mathbb R^p, &
(x_1,\dots,x_n) &\mapsto \frac{1}{n}\sum_{i=1}^n x_i.
\end{align*}
For a random sample $X_1,\dots,X_n$, this gives
\begin{align*}
\bar X = \frac{1}{n}\sum_{i=1}^n X_i.
\end{align*}
[/definition]
The vector $\bar X$ is the coordinatewise average, but it is more than a collection of $p$ separate means: under affine changes of coordinates, it transforms as a vector. Location alone cannot describe a multivariate cloud, because two samples can have the same mean but very different elongation and orientation. The next statistic therefore records the second-order geometry of the residuals after the common location has been removed.
[definition: Sample Covariance Matrix]
For $n\ge 2$, the sample covariance matrix is the statistic
\begin{align*}
S:(\mathbb R^p)^n &\to \mathbb S^p, &
(x_1,\dots,x_n) &\mapsto \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)(x_i-\bar x)^\top,
\end{align*}
where $\mathbb S^p$ denotes the set of real symmetric $p\times p$ matrices and $\bar x=n^{-1}\sum_{i=1}^n x_i$. For a random sample $X_1,\dots,X_n$, this gives
\begin{align*}
S = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)(X_i-\bar X)^\top.
\end{align*}
[/definition]
The denominator $n-1$ reflects the loss of one vector-valued degree of freedom caused by centering at $\bar X$. It is often useful to separate this unbiased version from the maximum likelihood version
\begin{align*}
S_n = \frac{1}{n}\sum_{i=1}^n (X_i-\bar X)(X_i-\bar X)^\top = \frac{n-1}{n}S.
\end{align*}
This raises a basic calibration question: which normalisation targets the population covariance on average under repeated sampling? Before studying the full distribution of $S$, we need to know that its entries are centred at the corresponding entries of $\Sigma$. The next result supplies that first-order benchmark without requiring normality.
[quotetheorem:4008]
[citeproof:4008]
This result explains why two covariance normalisations coexist. The matrix $S$ is unbiased for $\Sigma$, while $S_n$ is the likelihood maximiser under the Gaussian model. Normality is not needed here; the conclusion can fail only if independence or finite second moments fail, for instance when cross terms no longer have expectation zero or the entries of $X_iX_i^\top$ are not integrable. The theorem also does not say that $S$ is close to $\Sigma$ for a particular small sample, only that its average over repeated samples equals $\Sigma$. The stronger Gaussian structure enters in the next section, using the Chapter 2 fact that uncorrelated jointly normal blocks are independent.
[example: Bivariate Mean and Covariance]
Suppose $p=2$ and the observations are $X_i=(Y_i,Z_i)^\top$. Then
\begin{align*}
S = \begin{pmatrix}
\frac{1}{n-1}\sum_i (Y_i-\bar Y)^2 & \frac{1}{n-1}\sum_i (Y_i-\bar Y)(Z_i-\bar Z) \\
\frac{1}{n-1}\sum_i (Y_i-\bar Y)(Z_i-\bar Z) & \frac{1}{n-1}\sum_i (Z_i-\bar Z)^2
\end{pmatrix}.
\end{align*}
This example shows that the diagonal entries are the usual sample variances and the off-diagonal entries record paired movement of the two coordinates. A direct expansion verifies that the expectation of each entry agrees with the corresponding entry of $\Sigma$.
[/example]
## Orthogonal Decomposition of a Normal Sample
Why does the normal model give more than unbiasedness? The answer is that Gaussian vectors preserve independence under orthogonal projections: uncorrelated linear projections of a jointly normal vector are independent. The sample mean is one projection of the data, and the centered residuals are the complementary projection. Without normality this implication breaks: uncorrelated summaries can still be dependent, so a sample mean can carry information about the sample spread. This is the obstruction the next theorem removes in the Gaussian case.
To express this cleanly, write the stacked sample as
\begin{align*}
Y = \begin{pmatrix} X_1 \\ \vdots \\ X_n \end{pmatrix} \in \mathbb R^{np}.
\end{align*}
The subspace of constant rows corresponds to the sample mean, while its orthogonal complement corresponds to residual variation.
[definition: Outer Product Notation]
For column vectors $a,b\in\mathbb R^p$, the notation
\begin{align*}
a\otimes b
\end{align*}
means the outer-product matrix $ab^\top$. Thus $(X_i-\bar X)\otimes (X_i-\bar X)$ is the same $p\times p$ covariance contribution as $(X_i-\bar X)(X_i-\bar X)^\top$. Later, when both inputs are matrices, the same symbol denotes the Kronecker product; the dimensions make clear which operation is intended.
[/definition]
With this notation in place, the theorem can state the Gaussian sample decomposition as a clean separation between the mean direction and the residual outer-product information.
[illustration:sample-orthogonal-decomposition]
The geometric decomposition still needs a distributional theorem: orthogonal residual directions must be independent of the mean direction under the Gaussian model. The result below packages that fact together with the Wishart law for the residual outer products.
[quotetheorem:4009]
[citeproof:4009]
This theorem is the multivariate analogue of the normal sampling theorem. It is special to the normal model: for general distributions, the sample mean and sample covariance usually interact. For example, if a one-dimensional population is symmetric but has heavier tails than a normal distribution, an unusually large sample mean often occurs together with unusually large residual variation because both are affected by extreme observations. The theorem does not assert that $\bar X$ is independent of every statistic built from the sample, only of the residual covariance information. This separation is the reason later tests can combine a normal mean component with an independent covariance estimate.
[example: Why Orthogonality Matters]
For $n=2$, take
\begin{align*}
Y_1 = \frac{X_1+X_2}{\sqrt 2}, \qquad Y_2 = \frac{X_1-X_2}{\sqrt 2}.
\end{align*}
Then $Y_1$ determines $\bar X$ and $Y_2Y_2^\top$ determines $S$. Under the normal model, $Y_1$ and $Y_2$ are independent because their covariance is zero after the orthogonal transformation. This two-observation case displays the geometric reason for the general theorem.
[/example]
## Distribution of the Sample Mean
Once the mean component has been separated, its distribution follows from affine stability of the multivariate normal. The question is not merely whether $\bar X$ is centered at $\mu$, but how its uncertainty scales with $n$ and with the covariance geometry of the population.
[quotetheorem:4010]
[citeproof:4010]
This theorem is the finite-sample version of the multivariate central limit theorem in the Gaussian case. No limiting approximation is needed: the distribution is exact for every $n$. Independence is essential for the covariance scaling; if the observations are positively correlated, the variance of the average contains cross-covariance terms and need not be $\Sigma/n$. The theorem also does not estimate $\Sigma$ by itself, so confidence regions based on it are directly usable only when $\Sigma$ is known or replaced by an independent covariance estimate. That replacement is precisely why the preceding independence theorem matters.
[example: Linear Contrasts of the Sample Mean]
Let $a\in\mathbb R^p$ be nonzero. The scalar contrast $a^\top\bar X$ satisfies
\begin{align*}
\sqrt n\,a^\top(\bar X-\mu) \sim \mathcal N(0,a^\top\Sigma a).
\end{align*}
Thus confidence statements for any fixed linear combination of coordinates reduce to the one-dimensional normal theory. The direction $a$ determines which axis of the covariance ellipsoid is being measured.
[/example]
## Maximum Likelihood Estimation of the Normal Parameters
The next question is how the Gaussian likelihood chooses estimators for location and scatter. Unbiasedness is a moment property; maximum likelihood instead asks which parameter values make the observed data most probable under the model. Unbiasedness alone does not choose between competing estimators, because many different statistics may have the same expectation while behaving differently under transformation, constraints, or large-sample approximation. In the normal model the likelihood also encodes the geometry of ellipsoidal density contours, so it supplies a separate criterion for both location and scatter.
Assume for this section that $\Sigma$ is positive definite. Up to an additive constant independent of $(\mu,\Sigma)$, the log-likelihood for observations $x_1,\dots,x_n\in\mathbb R^p$ is
\begin{align*}
\ell(\mu,\Sigma)
= -\frac{n}{2}\log\det\Sigma
-\frac{1}{2}\sum_{i=1}^n (x_i-\mu)^\top\Sigma^{-1}(x_i-\mu).
\end{align*}
[quotetheorem:4011]
[citeproof:4011]
The likelihood estimator of covariance uses denominator $n$, not $n-1$. Hence it is biased by the factor $(n-1)/n$, but it is the natural estimator for likelihood-based asymptotic theory. The full-rank condition is not cosmetic: if the centred observations span a proper subspace of $\mathbb R^p$, the likelihood can be increased by making $\det\Sigma$ small in directions with no residual variation, so no positive definite maximizer exists. In particular, one needs enough nondegenerate data, and typically $n>p$, before the unrestricted covariance MLE is well behaved. This finite-sample rank issue is one reason high-dimensional covariance estimation needs methods beyond the classical MLE.
Once the finite-sample maximizer exists, the next question is whether likelihood has selected the right population quantities rather than merely a convenient stationary point. The bias in the covariance estimator and the rank restriction both make this worth checking: a useful likelihood estimator must recover $\mu$ and $\Sigma$ as the sample size grows, despite using the likelihood normalization and despite being defined through a random scatter matrix.
[quotetheorem:4012]
[citeproof:4012]
Consistency says that the estimators converge to the right targets, but it does not say how large the remaining error is for a large finite sample. For inference, we need the scale and shape of the fluctuations around the limit. Asymptotic normality supplies this first-order approximation, and the covariance estimator has a matrix-valued limiting distribution most conveniently expressed after vectorising its entries.
[quotetheorem:4013]
[citeproof:4013]
These asymptotic statements justify the MLEs for large samples, while the previous independence theorem supplies stronger finite-sample structure in the Gaussian case. Positive definiteness of $\Sigma$ ensures a regular nondegenerate limit; if $\Sigma$ is singular, some linear combinations have zero variance and the usual vectorised normal approximation must be restricted to the effective subspace. The covariance formula also uses Gaussian fourth moments, so the same display is not valid unchanged for arbitrary distributions with the same covariance matrix. This distinction between model-specific exact structure and broader large-sample approximation motivates asking which parts of the raw sample are actually informative under the normal model.
## Sufficiency of the Mean and Covariance
The final question in this chapter is whether the whole data set contains information about $(\mu,\Sigma)$ beyond $\bar X$ and $S$. Sufficiency answers this by asking whether the likelihood can be factored into a parameter-dependent part involving the statistic and a parameter-free part involving the raw data. A statistic can lose information: for instance, keeping only $\bar X$ cannot distinguish a tightly clustered sample from a widely scattered one, so it cannot be sufficient for an unknown covariance matrix. The point of the next definition is to make this idea precise without relying on intuition about particular summaries.
[definition: Sufficient Statistic]
Let $(P_\theta)_{\theta\in\Theta}$ be a statistical model on a sample space $\mathcal X$ with density or mass function $f_\theta(x)$. A statistic $T:\mathcal X\to\mathcal T$ is sufficient for $\theta$ if the likelihood factors as
\begin{align*}
f_\theta(x)=g_\theta(T(x))h(x)
\end{align*}
for some nonnegative functions $g_\theta:\mathcal T\to[0,\infty)$ and $h:\mathcal X\to[0,\infty)$.
[/definition]
The factorisation criterion makes sufficiency practical: it is enough to inspect how the parameter enters the joint density. It also explains why sufficiency is model-dependent: the same statistic may be sufficient in one family and insufficient in another if the likelihood uses the raw data in a different way. In the multivariate normal family, the exponent is quadratic in the observations, so the natural candidates are exactly the first and second empirical moments.
The remaining issue is whether those candidates actually capture every occurrence of the unknown mean and covariance in the joint density. If any parameter-dependent term still involved the individual observations beyond $\bar X$ and $S$, then these summaries would be convenient but not sufficient. The theorem resolves that likelihood factorisation question for the normal sample.
[quotetheorem:4014]
[citeproof:4014]
Sufficiency formalises why the chapter focuses on $\bar X$ and $S$. Under the normal model, these two summaries retain all information about the unknown location and scatter parameters. The positive definiteness assumption keeps the density formula valid with respect to Lebesgue measure; singular normal laws require a different support and a modified factorisation argument. The theorem does not say that $(\bar X,S)$ is low-dimensional when $p$ is large, nor that either component is stable in small samples. Those limitations lead naturally to the distribution of $S$ and to dimension-reduction methods that study the eigenstructure of sample covariance matrices.
[example: Eigenvalues Of The Sample Covariance]
Consider repeated simulations with $X_1,\dots,X_n\sim\mathcal N_p(0,I_p)$ and compute the eigenvalues $\lambda_1\ge\dots\ge\lambda_p$ of $S$ each time. When $n$ is not much larger than $p$, the largest eigenvalues tend to be noticeably above $1$ and the smallest eigenvalues tend to be noticeably below $1$; if $n\le p$, then $S$ is singular because the centred data have rank at most $n-1$. The simulation illustrates that $S$ is unbiased entrywise but its eigenvalues can be highly variable in high-dimensional finite samples. This phenomenon motivates later topics such as principal component analysis and dimension reduction.
[/example]
The chapter has established the sampling foundation for the rest of the course. The mean and covariance are unbiased in the standard normal-sampling sense, independent under Gaussian assumptions, likelihood-based estimators of $(\mu,\Sigma)$, and sufficient summaries of the data. The next stage is to describe the distribution of the covariance component itself, which leads to the Wishart distribution and then to Hotelling $T^2$ statistic.
Once the sample mean and covariance have been identified as the key summaries, the remaining task is to understand the random matrix formed by the covariance contribution itself. That leads directly to the Wishart distribution, which is the sampling law underlying multivariate covariance inference and many later test statistics.
# 4. The Wishart Distribution
Chapter 3 turned the multivariate normal distribution into sampling theory for $\bar X$ and $S$ by reducing the sample into mean and residual components. This chapter studies the random matrices obtained by adding outer products of independent normal vectors. These matrices are the sampling distributions of covariance matrices, and they will become the algebraic engine behind Hotelling T^2, MANOVA, and likelihood ratio tests for covariance structure.
## From Normal Vectors to Random Covariance Matrices
The basic statistical problem is to understand the fluctuation of a sample covariance matrix. In one dimension, sums of squared centred normal observations give chi-squared random variables. In several dimensions, the same construction gives a random positive semidefinite matrix rather than a scalar.
[definition: Wishart Distribution]
Let $X_1,\dots,X_n$ be independent random vectors with $X_i \sim \mathcal N_p(0,\Sigma)$, where $\Sigma \in \mathbb R^{p\times p}$ is symmetric positive semidefinite. Define
\begin{align*}
W = \sum_{i=1}^n X_iX_i^\top.
\end{align*}
The distribution of $W$ is called the $p$-dimensional Wishart distribution with $n$ degrees of freedom and scale matrix $\Sigma$, written
\begin{align*}
W \sim W_p(n,\Sigma).
\end{align*}
[/definition]
The matrix $W$ records all sums of products among the coordinates of the sample. Its diagonal entries are sums of squares, while its off-diagonal entries are sums of cross-products. When $\Sigma$ is positive definite and $n\ge p$, the matrix $W$ is positive definite with probability $1$; when $n<p$, it has rank at most $n$ and is singular.
[example: One-Dimensional Wishart]
Take $p=1$ and write $\Sigma=(\sigma^2)$. If $X_i\sim \mathcal N(0,\sigma^2)$ are independent, then
\begin{align*}
W=\sum_{i=1}^n X_i^2.
\end{align*}
Writing $X_i=\sigma Z_i$ with $Z_i\sim \mathcal N(0,1)$ gives $W/\sigma^2=\sum_{i=1}^n Z_i^2\sim \chi_n^2$. Thus $W_1(n,\sigma^2)$ is the scaled chi-squared distribution $\sigma^2\chi_n^2$, so the Wishart distribution is the matrix-valued analogue of chi-squared sampling theory.
[/example]
The connection with sample covariance is immediate once the sample mean is removed. This is the main bridge between the definition and multivariate statistical inference. The point that needs care is that centring the data changes the number of independent normal directions.
[example: Why One Degree of Freedom Is Lost]
For scalar observations $Y_1,\dots,Y_n\sim \mathcal N(\mu,\sigma^2)$, the centred variables $Y_i-\bar Y$ satisfy $\sum_{i=1}^n (Y_i-\bar Y)=0$. They therefore cannot be $n$ independent normal variables. The sum of squares has the law $\sigma^2\chi^2_{n-1}$ rather than $\sigma^2\chi^2_n$, and the matrix theorem below is the multivariate version of this loss of one independent contrast.
[/example]
This example explains why the theorem is not obtained by simply replacing $\mu$ with $\bar Y$ in the definition of $W_p(n,\Sigma)$.
For multivariate samples, the same centring operation removes one vector-valued direction from the data matrix. The remaining residual contrasts still carry covariance information, but only through $n-1$ independent Gaussian directions. The theorem identifies the exact distribution of the centred cross-product matrix, which is the distribution needed whenever the population mean is unknown and covariance is estimated from residuals.
[quotetheorem:4015]
[citeproof:4015]
This theorem explains why the degrees of freedom are $n-1$ rather than $n$: estimating the mean removes one independent normal contrast. Positive definiteness of $\Sigma$ keeps the population covariance non-degenerate; if $\Sigma$ is singular, the same construction is confined to a lower-dimensional subspace and the full $p$-dimensional covariance matrix is singular. Normality is also doing real work: for non-normal samples, the sample covariance may still be unbiased, but its exact finite-sample distribution is not Wishart in general.
Independence of the observations is equally essential. If all $Y_i$ are equal to the same normal vector, then the centred sample covariance is the zero matrix, not a Wishart matrix with $n-1$ degrees of freedom. Thus the theorem is an exact sampling distribution result for independent Gaussian samples, not a general covariance identity. It is the result that later permits Hotelling $T^2$ to combine an independent normal sample mean with an independent Wishart estimate of covariance.
## The Density and the Multivariate Gamma Function
The next question is whether the Wishart distribution has a density, and what the correct normalising constant should be. Since $W$ lives in the cone of symmetric positive definite matrices when $\Sigma$ is positive definite and $n\ge p$, the density must be written with respect to Lebesgue measure on the independent entries of a symmetric matrix.
[definition: Multivariate Gamma Function]
For $p\in\mathbb N$, the multivariate gamma function is the map
\begin{align*}
\Gamma_p:((p-1)/2,\infty)&\to(0,\infty),\\
a&\mapsto \pi^{p(p-1)/4}\prod_{j=1}^p \Gamma\left(a+\frac{1-j}{2}\right).
\end{align*}
[/definition]
The factor $\Gamma_p(a)$ is the matrix analogue of the ordinary gamma function. It appears when integrating powers of determinants against trace-exponential terms over the positive definite cone.
With the normalising constant identified, the next question is the exact density of the Wishart law on that cone. The formula must account simultaneously for the scale matrix, the determinant power near the boundary, and the trace term inherited from Gaussian quadratic forms. Stating the density makes precise when the outer-product construction produces a full-dimensional distribution on positive definite matrices.
[quotetheorem:4016]
[citeproof:4016]
The condition $n\ge p$ is not just a technical assumption for the formula. For integer degrees of freedom it is exactly the range where the outer-product representation can have full rank and the determinant power is locally integrable near the boundary of the positive definite cone. If $n<p$, the outer-product representation has rank at most $n$, so the distribution is supported on singular matrices and cannot have a density with respect to Lebesgue measure on all independent entries.
The positive definiteness of $\Sigma$ is also necessary for this density statement. A singular scale matrix confines every $X_i$ to a proper linear subspace, so the resulting $W$ again lives on a lower-dimensional set. The theorem therefore describes the non-singular Wishart law; singular Wishart distributions exist, but they require a different reference measure and are not described by the displayed density. This distinction matters later when likelihood ratio tests assume invertible sample covariance matrices.
[example: Density When $p=1$]
For $p=1$, the multivariate gamma function satisfies $\Gamma_1(a)=\Gamma(a)$. The density becomes
\begin{align*}
f(w)=\frac{w^{n/2-1}e^{-w/(2\sigma^2)}}{2^{n/2}(\sigma^2)^{n/2}\Gamma(n/2)},
\qquad w>0,
\end{align*}
which is the density of $\sigma^2\chi_n^2$. This recovers the usual chi-squared density from the general matrix formula.
[/example]
## Sums, Blocks, and Marginals
A sampling distribution is useful only if it behaves well under aggregation and projection. The Wishart distribution has two stability properties: independent samples add their degrees of freedom, and principal submatrices remain Wishart. Both properties have hypotheses that cannot be suppressed.
[example: Why Common Scale Is Needed]
In one dimension, let $W_1\sim \sigma_1^2\chi^2_{n_1}$ and $W_2\sim \sigma_2^2\chi^2_{n_2}$ be independent with $\sigma_1^2\ne\sigma_2^2$. The sum $W_1+W_2$ is a sum of scaled chi-squared variables with different scales, so it is not generally a single scaled chi-squared variable. Thus even in dimension one, additivity preserves the Wishart family only when the scale matrix is common.
[/example]
This obstruction is the reason the theorem below assumes the same $\Sigma$ for both matrices.
When two independent covariance cross-products come from the same population covariance, pooling them should preserve the common covariance geometry while increasing the amount of information. The question is whether this intuition is exact at the distributional level, not merely at the level of expectations. The additivity theorem answers that question and supplies the algebra behind combining independent sums of squares.
[quotetheorem:4017]
[citeproof:4017]
This result is the matrix analogue of additivity of independent chi-squared variables. The common scale matrix is essential because it means both samples estimate the same covariance structure; different scale matrices would mix two covariance geometries and generally leave the Wishart family. Independence is also essential: if $W_2=W_1$, then $W_1+W_2=2W_1$, which changes the scale rather than simply adding degrees of freedom.
The theorem does not say that arbitrary positive semidefinite random matrices add to a Wishart matrix. It says that independent Gaussian outer-product information pools cleanly when it comes from the same population covariance. This is the algebraic principle behind decomposing sums of squares in MANOVA into independent within-group and between-group components.
Aggregation is only one way to simplify a covariance cross-product; another is to ignore some coordinates and study a marginal block. Since a principal block corresponds to measuring only a selected subset of variables, one expects it to inherit a Wishart law with the matching submatrix of the scale parameter. The next result makes that marginal stability precise and separates principal blocks from other matrix subblocks.
[quotetheorem:4018]
[citeproof:4018]
The block theorem only concerns principal blocks, not arbitrary submatrices or Schur complements. Its hypotheses are inherited from the vector construction: selecting coordinates of a multivariate normal vector gives another multivariate normal vector with covariance equal to the corresponding principal block of $\Sigma$. A non-principal off-diagonal block such as $W_{12}$ is usually rectangular and is not itself a covariance cross-product matrix of one vector with itself, so it has no Wishart law in general.
This limitation is important in covariance testing. Marginal covariance estimates for a subset of variables remain Wishart, but cross-covariance blocks require separate distributional arguments.
Diagonal entries are the simplest principal blocks. They behave like scaled chi-squared variables, although the diagonal entries need not be mutually independent when the corresponding coordinates of the normal vector are correlated.
[quotetheorem:4019]
[citeproof:4019]
This theorem gives only one-dimensional marginal distributions. It does not assert that the diagonal entries of $W$ are jointly independent, nor does it determine the distribution of the full diagonal vector without using the covariance structure among the underlying squared normal coordinates. The condition $\sigma_{jj}>0$ excludes the degenerate case where the $j$th coordinate is identically zero and $w_{jj}=0$.
The result is still useful because many variance estimates reduce to diagonal entries of a Wishart matrix. The next example shows why marginal chi-squared laws should not be mistaken for independence.
[example: Two Correlated Coordinates]
Suppose $p=2$ and
\begin{align*}
\Sigma=\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix},
\qquad -1<\rho<1.
\end{align*}
Then $w_{11}\sim \chi_n^2$ and $w_{22}\sim \chi_n^2$ marginally. The entries are correlated when $\rho\ne0$, because both are computed from paired normal coordinates whose squares have non-zero covariance. This example separates marginal chi-squared behaviour from joint independence.
[/example]
## Bartlett Decomposition and Exact Simulation
The density is useful for likelihood theory, but simulation and many derivations require a constructive factorisation. Bartlett decomposition rewrites an identity-scale Wishart matrix as a triangular random matrix times its transpose. The construction depends on having enough observations to produce a full-rank triangular factor.
[example: What Fails When $n<p$]
If $X$ is an $n\times p$ standard normal data matrix with $n<p$, then $W=X^\top X$ has rank at most $n$. Therefore $W$ is singular with probability $1$ and cannot be written as $AA^\top$ with $A$ a nonsingular $p\times p$ triangular matrix whose diagonal entries are all positive. This is why the full Bartlett decomposition below assumes $n\ge p$.
[/example]
The theorem turns the geometric process of orthogonalising columns into independent scalar random variables.
For exact simulation, it is not enough to know the density; we need a recipe that produces a matrix with the correct joint dependence among entries. Orthogonalising the columns of a standard normal data matrix separates new length information from previously constructed directions. Bartlett decomposition packages this separation into a triangular factor whose diagonal and off-diagonal entries have simple independent distributions.
[quotetheorem:4020]
[citeproof:4020]
The assumption $n\ge p$ is what makes all diagonal chi-squared degrees of freedom positive and gives a nonsingular Cholesky factor. In the singular case $n<p$, a related rectangular factorisation remains possible, but the displayed triangular factor with $p$ positive diagonal entries does not exist. The identity scale assumption is also part of the theorem: it is what makes the orthogonal projections have standard normal coefficients.
For a general positive definite scale matrix, combine Bartlett's identity-scale factorisation with a Cholesky factor of $\Sigma$. If $\Sigma=LL^\top$ and $A$ is the Bartlett lower triangular factor for $W_p(n,I_p)$, then
\begin{align*}
LAA^\top L^\top \sim W_p(n,\Sigma).
\end{align*}
This gives an exact simulation algorithm: sample the independent entries of $A$, form $LA$, and return $(LA)(LA)^\top$. The triangular orientation is conventional rather than substantive; an upper triangular version is obtained by transposing the factor, but the degrees of freedom must stay matched to the order in which columns are orthogonalised.
[example: Simulating a $2\times2$ Wishart Matrix]
Let $n\ge2$ and $\Sigma=LL^\top$. For identity scale, Bartletts factor is
\begin{align*}
A=\begin{pmatrix}
\sqrt{U_1} & 0\\
Z & \sqrt{U_2}
\end{pmatrix},
\qquad
U_1\sim\chi_n^2,
\quad U_2\sim\chi_{n-1}^2,
\quad Z\sim\mathcal N(0,1),
\end{align*}
with $U_1,U_2,Z$ independent. The simulated covariance cross-product is $LAA^\top L^\top$. This shows how the degrees of freedom descend along the diagonal as each new column is orthogonalised against the previous ones.
[/example]
## First and Second Moments
For inference, we often need expectations and variances rather than the full density. These moment formulas are obtained directly from the outer-product representation and fourth moments of the multivariate normal distribution. The Gaussian fourth-moment identity is the part that makes the covariance formula take such a compact form.
[example: Matching Mean Does Not Determine Wishart Moments]
Let $X$ be a centred non-Gaussian random vector with covariance matrix $\Sigma$, and set $W=XX^\top$ for one observation. Then $\mathbb E[W]=\Sigma$ still holds when the second moments exist. However, $\operatorname{Cov}(W_{ij},W_{kl})$ depends on fourth moments of $X$, so it need not equal the Gaussian expression $\sigma_{ik}\sigma_{jl}+\sigma_{il}\sigma_{jk}$. Thus the moment theorem below is not a consequence of covariance alone.
[/example]
This example identifies the role of normality in the second-moment calculation.
The outer-product definition immediately suggests that the mean of a Wishart matrix should scale linearly with $n$, but variability of its entries is subtler. Each entry $W_{ij}$ mixes products of normal coordinates, so covariances between entries depend on fourth-order Gaussian structure. For applications, we need formulas that quantify this entrywise variability directly, since tests and approximations often use moments rather than the full density. The moment theorem records exactly which pairings survive and gives formulas that can be used without returning to the density.
[quotetheorem:4021]
[citeproof:4021]
The mean formula says that $n^{-1}W$ is an unbiased estimator of $\Sigma$ when the mean is known to be zero. In the unknown mean case, the theorem on sample covariance shows that $S$ is unbiased because $\mathbb E[(n-1)S]=(n-1)\Sigma$.
The covariance formula relies on Gaussian fourth moments through Isserlis formula. For non-Gaussian observations with the same covariance matrix, the expectation of $W$ may be unchanged, but the variances and covariances of its entries include fourth cumulant terms. The formula also does not imply independence of entries: zero covariance between two entries is weaker than independence, and many entries of a Wishart matrix are dependent even when their covariance happens to vanish.
These moments are the ingredients for assessing the precision of covariance estimators and for deriving large-sample approximations when exact Wishart calculations become unwieldy.
[example: Variance of a Sample Variance]
For $p=1$, write $W\sim W_1(n,\sigma^2)$. The variance formula gives
\begin{align*}
\operatorname{Var}(W)=n(\sigma^4+\sigma^4)=2n\sigma^4,
\end{align*}
matching the variance of $\sigma^2\chi_n^2$. For a sample covariance matrix with unknown mean, $(n-1)S\sim W_1(n-1,\sigma^2)$, so $\operatorname{Var}(S)=2\sigma^4/(n-1)$.
[/example]
## Spectral Viewpoint and High-Dimensional Preview
The Wishart distribution also controls the random eigenvalues of sample covariance matrices. Classical multivariate analysis usually treats $p$ fixed and $n$ large, while modern high-dimensional statistics often studies regimes where $p/n$ tends to a positive constant.
[explanation: Eigenvalues of Wishart Matrices]
If $W\sim W_p(n,\Sigma)$ and $\Sigma=I_p$, the eigenvalues of $n^{-1}W$ describe the principal variances observed in an isotropic normal sample. For fixed $p$, these eigenvalues concentrate near $1$ as $n\to\infty$. In a high-dimensional regime with $p/n\to\gamma\in(0,\infty)$, the empirical eigenvalue distribution has a non-degenerate limit, the Marchenko-Pastur law, in the identity-scale case. This preview explains why Wishart matrices appear both in classical covariance inference and in random matrix theory.
[/explanation]
The Marchenko-Pastur statement is quoted here only as a high-dimensional preview, not as a theorem proved in this chapter. The present chapter uses it to indicate the spectral behaviour that later random matrix methods make precise.
[example: Sample Eigenvalues Compared with Marchenko-Pastur]
Take $\Sigma=I_p$ and simulate $W\sim W_p(n,I_p)$ for large $p$ and $n$ with $p/n\approx\gamma$. The eigenvalues of $n^{-1}W$ typically fill an interval close to
\begin{align*}
\left((1-\sqrt{\gamma})^2,(1+\sqrt{\gamma})^2\right)
\end{align*}
when $0<\gamma<1$. This contrasts with the fixed-$p$ picture, where each sample eigenvalue converges to $1$ as $n$ grows.
[/example]
The rest of the course uses the Wishart distribution as a building block. Chapter 5 uses Hotelling's $T^2$ to combine a multivariate normal mean vector with an independent Wishart covariance estimator, while Chapter 10 uses MANOVA to compare Wishart-type within-group and between-group sums of squares.
The Wishart distribution supplies the covariance side of the multivariate sampling theory, so the next step is to combine it with the normal mean vector for formal inference. Hotelling's $T^2$ turns that pairing into a multivariate analogue of the one-sample $t$ test, with ellipsoidal geometry replacing the one-dimensional distance.
# 5. Hotelling's $T^2$ and Inference on the Mean
Hotelling's $T^2$ is the multivariate analogue of the one-sample Student $t$ statistic. The scalar question "is the population mean equal to $\theta_0$?" becomes the vector question "is the mean vector equal to $\mu_0$?", and the sample covariance matrix decides how distances in different directions should be scaled. This chapter assumes the earlier course material on multivariate normal distributions, Wishart covariance matrices, quadratic forms, and the univariate $t$ and $F$ tests. It develops the one-sample and two-sample T^2 tests, derives their exact finite-sample F distributions under multivariate normality, and turns the same quadratic geometry into confidence ellipsoids and simultaneous intervals.
## The One-Sample Mean Problem
How should we test a vector mean when the coordinates may be correlated and measured on different scales? A Euclidean distance such as $|\bar{X}-\mu_0|^2$ treats every direction equally, but multivariate data usually have directions of high and low variance. Hotelling's $T^2$ statistic replaces Euclidean distance by covariance-scaled distance.
[definition: Sample Mean And Sample Covariance]
Let $X_1,\dots,X_n$ be random vectors in $\mathbb R^p$. The sample mean and unbiased sample covariance matrix are
\begin{align*}
\bar{X} &= \frac{1}{n}\sum_{i=1}^n X_i, \\
S &= \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})(X_i-\bar{X})^\top .
\end{align*}
[/definition]
The matrix $S$ estimates the unknown covariance matrix $\Sigma$. When $S$ is invertible, the displacement $\bar{X}-\mu_0$ is measured in units determined by the observed covariance structure. The test statistic therefore has to combine the mean displacement, the inverse covariance estimate, and the sample size into one coordinate-free squared distance.
This construction is the multivariate analogue of standardising a sample mean by its estimated standard error. The next formal block names the resulting quadratic statistic used for the one-sample mean test.
[definition: One-Sample Hotelling Statistic]
Let $X_1,\dots,X_n$ be random vectors in $\mathbb R^p$, let $S$ be the unbiased sample covariance matrix, and let $\mu_0\in\mathbb R^p$. If $S$ is invertible, the one-sample Hotelling statistic for testing $H_0:\mu=\mu_0$ is
\begin{align*}
T^2 = n(\bar{X}-\mu_0)^\top S^{-1}(\bar{X}-\mu_0).
\end{align*}
[/definition]
The factor $n$ appears because the covariance of $\bar{X}$ is $\Sigma/n$, so the statistic compares the observed mean displacement to the sampling variability of the mean rather than to the variability of an individual observation.
[example: Univariate Reduction]
Suppose $p=1$, so $X_1,\dots,X_n$ are real-valued observations with sample variance $s^2$. Then
\begin{align*}
T^2 = n\frac{(\bar{X}-\mu_0)^2}{s^2}=t^2,
\end{align*}
where $t=\sqrt{n}(\bar{X}-\mu_0)/s$ is the usual one-sample Student statistic. Thus Hotelling statistic is not a new test in dimension one; it is the coordinate-free multivariate extension of the same idea.
[/example]
For multivariate normal data the statistic has an exact distribution after rescaling. The result depends on the sample size exceeding the dimension, since $S$ is invertible with probability one only when $n-1\ge p$ under a non-singular normal model.
A test needs more than a large value of $T^2$; it needs a null calibration that turns the statistic into a rejection threshold. Normal sampling is the case where the joint behaviour of $\bar X$ and $S$ is rigid enough to give an exact finite-sample law rather than only an asymptotic approximation.
[quotetheorem:4022]
[citeproof:4022]
This theorem is the practical bridge between multivariate normal sampling theory and hypothesis testing. The normality and positive-definiteness assumptions are doing real work: without them, $\bar X$ and $S$ need not be independent, and the exact $F$ calibration can fail even when the first two moments exist. The condition $n>p$ is also structural, because if $n\le p$ then $S$ is singular with probability one in the non-singular normal model and the inverse covariance scaling is not defined. The result does not say that large $T^2$ identifies which coordinate caused the rejection; it only measures a joint covariance-scaled departure, so later simultaneous intervals are needed for componentwise interpretation.
A level-$\alpha$ test rejects $H_0$ when
\begin{align*}
T^2 > \frac{p(n-1)}{n-p}F_{p,n-p;1-\alpha},
\end{align*}
where $F_{p,n-p;1-\alpha}$ is the $(1-\alpha)$ quantile of the $F_{p,n-p}$ distribution. This same covariance-scaled geometry is the finite-sample version of Mahalanobis distance and is closely related to likelihood ratio tests for multivariate normal means.
## Deriving The F Reduction
Why does an inverse random covariance matrix lead to an $F$ distribution instead of a chi-squared distribution? If $\Sigma$ were known, the statistic $n(\bar X-\mu_0)^\top\Sigma^{-1}(\bar X-\mu_0)$ would have a $\chi^2_p$ distribution. The extra randomness in $S^{-1}$ produces the denominator degrees of freedom.
[quotetheorem:4023]
[citeproof:4023]
The known-covariance theorem is the benchmark that Hotelling's statistic modifies. Its conclusion depends on knowing the exact shape of $\Sigma$; replacing $\Sigma$ by a poor plug-in estimate without changing the reference distribution would understate uncertainty. The theorem does not handle covariance estimation, nor does it protect against misspecification of the covariance metric.
The unknown covariance case replaces $\Sigma$ by $S$, so the reference distribution must account for random scaling as well as random displacement of the sample mean. The key point is not only that $S$ estimates $\Sigma$, but that in the normal model $S$ is independent of $\bar X$ and $(n-1)S$ has the Wishart law $W_p(n-1,\Sigma)$ in the notation of this chapter. These two facts combine to give Hotelling's exact $F$ calibration under the null hypothesis. The independence statement is special to the multivariate normal model; for skewed or heavy-tailed populations, the sample mean and sample covariance can interact, and the exact pivot is lost. The Wishart degrees of freedom also record the cost of centring: the residual covariance is built from only $n-1$ independent residual directions. This calibration does not claim that $S$ is close to $\Sigma$ in every high-dimensional regime; if $p$ is comparable to $n$, invertibility and stability become separate statistical issues.
The $F$ transformation has numerator dimension $p$ because the mean can depart from $\mu_0$ in $p$ independent directions. The denominator degrees of freedom are $n-p$ after the Hotelling pivot is converted to the $F$ scale; this is distinct from the $n-1$ residual degrees of freedom in the Wishart covariance estimator itself.
[example: Iris One-Sample Test]
In the Iris data, take one species, so $n=50$ and $p=4$, with coordinates sepal length, sepal width, petal length, and petal width. Suppose the proposed mean vector is $\mu_0$ and the computed statistic is $T^2=18$. The exact test statistic reported on the $F$ scale is
\begin{align*}
\frac{50-4}{4(50-1)}T^2=\frac{46}{196}\cdot 18\approx 4.22,
\end{align*}
which is compared with $F_{4,46}$, not with a $\chi^2_4$ distribution. The point of the example is diagnostic: the same raw displacement $\bar X-\mu_0$ can look large or small depending on whether it lies along a high-correlation petal direction or along a low-variance contrast direction, and $S^{-1}$ is what makes that distinction.
[/example]
## Comparing Two Multivariate Means
How do we compare the mean vectors of two populations when each observation is multivariate? The univariate two-sample $t$ test pools variance when the two populations have a common variance; Hotelling two-sample statistic does the same with a common covariance matrix.
[definition: Pooled Sample Covariance]
Let $X_1,\dots,X_{n_1}$ be observations in $\mathbb R^p$ from the first population and $Y_1,\dots,Y_{n_2}$ observations in $\mathbb R^p$ from the second population. Let $S_X$ and $S_Y$ be the unbiased sample covariance matrices. The pooled covariance estimate is
\begin{align*}
S_p = \frac{(n_1-1)S_X+(n_2-1)S_Y}{n_1+n_2-2}.
\end{align*}
[/definition]
Pooling is justified only under the model assumption that both populations have the same covariance matrix. The statistic then scales the difference of sample means by the estimated covariance of $\bar X-\bar Y$.
With two independent samples, the natural signal is the vector difference between their empirical centres. To compare this vector difference across correlated coordinates, it must be converted into a single covariance-scaled distance using the pooled covariance estimate.
[definition: Two-Sample Hotelling Statistic]
Let $\bar X$ and $\bar Y$ be the two sample means, and suppose $S_p$ is invertible. The two-sample Hotelling statistic for testing $H_0:\mu_X=\mu_Y$ under a common covariance model is
\begin{align*}
T^2 = \frac{n_1n_2}{n_1+n_2}(\bar X-\bar Y)^\top S_p^{-1}(\bar X-\bar Y).
\end{align*}
[/definition]
The prefactor is the reciprocal of the scalar variance factor $1/n_1+1/n_2$. Thus the statistic is the squared covariance-scaled distance between the two sample means.
To turn this distance into an exact test, the statistic must be related to a distribution whose quantiles are known. Under normality and a common covariance matrix, the two-sample problem has the same kind of finite-sample $F$ calibration as the one-sample problem.
[quotetheorem:4025]
[citeproof:4025]
The equal-covariance assumption is substantial. If the covariance matrices differ, the pooled covariance can overweight the more variable group and distort directions of separation. The theorem also requires enough total residual degrees of freedom to invert $S_p$; otherwise the statistic is undefined without regularisation. It does not justify pooling as a data-driven convenience, since equality of covariance matrices is a modelling assumption rather than a consequence of the formula. If the covariance matrices differ, there is no exact finite-sample pivot with the same simple form, and the problem becomes the multivariate Behrens--Fisher problem.
[remark: Behrens Fisher Difficulty]
For unequal covariance matrices, the natural statistic uses the estimated covariance $S_X/n_1+S_Y/n_2$ of $\bar X-\bar Y$. Its distribution depends on the unknown covariance matrices in a way that cannot be removed by a single scalar degrees-of-freedom correction. Common responses include large-sample chi-squared approximations, bootstrap calibration, permutation methods under exchangeability assumptions, or specialised approximate degrees-of-freedom procedures.
[/remark]
This warning matters in applied multivariate work because covariance structure is often part of the scientific signal. The two-sample Hotelling test is strongest when the groups differ mainly in mean while their scatter is plausibly common. When the groups have different covariance shapes, graphical checks and covariance diagnostics should accompany the test before interpreting a rejection as a pure mean shift.
[example: Comparing Two Iris Species]
To compare versicolor and virginica in the Iris data, use $n_1=n_2=50$ and $p=4$. If the pooled-covariance statistic is $T^2=160$, then the exact transformed statistic is
\begin{align*}
\frac{50+50-4-1}{4(50+50-2)}T^2=\frac{95}{392}\cdot 160\approx 38.78,
\end{align*}
which is compared with $F_{4,95}$. The structural lesson is that this is not four separate univariate tests: a separation in petal length and petal width may mostly occupy one correlated direction, so the pooled inverse covariance measures the joint separation after removing duplicated information. A useful diagnostic after rejection is to inspect the vector $S_p^{-1/2}(\bar X-\bar Y)$, whose large coordinates identify the covariance-standardised contrasts driving the result.
[/example]
## Confidence Ellipsoids And Simultaneous Inference
What does the $T^2$ test say about plausible values of the whole mean vector, rather than just one proposed value? Inverting the acceptance region of the one-sample test gives an ellipsoid centred at $\bar X$. The axes of the ellipsoid are determined by the eigenvectors and eigenvalues of $S$.
[quotetheorem:4026]
[citeproof:4026]
The ellipsoid records joint uncertainty. Its normal and positive-definite covariance assumptions are inherited from the exact Hotelling test; without them, the stated finite-sample coverage need not hold. A Euclidean ball around $\bar X$ would change its inferential meaning if one coordinate were rescaled from centimetres to millimetres, while the ellipsoid is invariant under non-singular linear reparameterisations. The theorem also does not say that each coordinate interval obtained by projecting the ellipsoid is the shortest possible interval for that coordinate; it is a joint confidence statement, and the next results compare different ways to project it.
A long axis corresponds to a direction in which the mean is estimated imprecisely; a short axis corresponds to a direction with smaller sampling variability.
[example: Confidence Ellipse In Two Dimensions]
When $p=2$, diagonalise the sample covariance matrix as $S=Q\Lambda Q^\top$, where $\Lambda=\operatorname{diag}(\lambda_1,\lambda_2)$ and $Q$ is orthogonal. The confidence region consists of points $m$ such that
\begin{align*}
\frac{n y_1^2}{\lambda_1}+\frac{n y_2^2}{\lambda_2}\le c_\alpha,
\end{align*}
where $y=Q^\top(m-\bar X)$ and $c_\alpha=p(n-1)F_{p,n-p;1-\alpha}/(n-p)$. The ellipse is centred at $\bar X$, rotated by the eigenvectors of $S$, and has semi-axis lengths $\sqrt{c_\alpha\lambda_i/n}$.
[/example]
[illustration:confidence-ellipse]
Often the goal is not to report the full ellipsoid but to make statements about many linear combinations $a^\top\mu$ at once. Simultaneous intervals control the probability that all stated intervals are correct in the same sample.
[quotetheorem:4027]
[citeproof:4027]
Scheffe intervals are designed for all linear combinations, so they can be wider than intervals needed for a fixed finite list of coordinates. The phrase "for every $a$" is the key strength and the key cost: choosing $a$ after seeing the data is still covered, but the interval pays for an infinite family of possible directions. If one instead reports separate unadjusted coordinate intervals, the probability that all coordinates are covered can be much less than $1-\alpha$, especially when $p$ is large. Scheffe's result does not identify a preferred scientific contrast; it supplies simultaneous protection once the analyst decides which linear functionals to read from the ellipsoid.
For a planned list of $k$ scalar statements, Bonferroni intervals use univariate $t$ intervals at level $\alpha/k$ and control the family-wise error probability by the union bound.
The motivation changes when the analyst has only a finite set of preselected contrasts rather than all possible directions. In that setting, paying the full Scheffe price can be unnecessarily conservative, especially when $k$ is small compared with the continuum of linear combinations. Bonferroni's method asks for a weaker but often more practical guarantee: simultaneous coverage for a specified list, obtained by allocating the error probability across the listed intervals.
[quotetheorem:4028]
[citeproof:4028]
Bonferroni's theorem needs the contrasts to be fixed before observing the data; if the list is selected after looking at large effects, the stated family-wise guarantee no longer applies. Its advantage is that it does not require considering every possible linear combination, so for a small planned family it may be much narrower than Scheffe's method. The theorem also gives at least $1-\alpha$ coverage rather than exact coverage in general, because the union bound may be conservative when the scalar statistics are correlated. This prepares the final question of the chapter: once a rejection region is fixed, how likely is it to detect a real covariance-scaled departure?
## Power And Non-Centrality
How likely is the $T^2$ test to detect a false null hypothesis? The answer depends on how far $\mu$ is from $\mu_0$ in covariance-scaled distance, not on Euclidean distance alone. This distance becomes the non-centrality parameter of the distribution under alternatives.
[definition: Hotelling Non-Centrality Parameter]
In the one-sample multivariate normal model with true mean $\mu$ and covariance matrix $\Sigma$, the non-centrality parameter for testing $H_0:\mu=\mu_0$ is
\begin{align*}
\lambda = n(\mu-\mu_0)^\top\Sigma^{-1}(\mu-\mu_0).
\end{align*}
[/definition]
Large values of $\lambda$ mean the true mean is far from the null value relative to the sampling covariance of the mean. A small Euclidean shift in a low-variance direction may have higher power than a larger Euclidean shift in a high-variance direction.
Power calculations require the full distribution of the test statistic under a false null, not just a geometric measure of separation. Under a fixed alternative, the same covariance-scaled displacement becomes the parameter that shifts the null calibration into a non-central law.
[quotetheorem:4029]
[citeproof:4029]
Therefore the power of the level-$\alpha$ one-sample test is
\begin{align*}
\mathbb P_\mu\left(F_{p,n-p}(\lambda)>F_{p,n-p;1-\alpha}\right),
\end{align*}
a probability computed under the non-central $F$ distribution. The theorem keeps the same normality, independence, and invertibility requirements as the central distribution result; under non-normal alternatives, $\lambda$ remains a useful signal-to-noise measure but no longer guarantees the exact non-central $F$ law. It also shows what the test cannot detect well: departures lying mainly in high-variance directions can have small non-centrality even when their Euclidean length is visually large. This interpretation links Hotelling's test to power calculations in MANOVA and to confidence ellipsoids in multivariate regression, where the same quadratic form controls detectable effects.
Power increases with $n$, with covariance-scaled separation, and with alternatives aligned to directions of low variance.
[example: Direction Matters For Power]
Suppose $p=2$ and $\Sigma=\operatorname{diag}(1,9)$. A shift $\mu-\mu_0=(1,0)$ gives contribution $1$ to $(\mu-\mu_0)^\top\Sigma^{-1}(\mu-\mu_0)$, while a shift $(0,1)$ gives contribution $1/9$. With the same sample size, the first alternative has a larger non-centrality parameter and hence higher power, because it lies in the lower-variance direction.
[/example]
The central message is that Hotelling T^2 turns multivariate mean inference into geometry with the sample covariance matrix as the metric. The same statistic gives tests, confidence ellipsoids, simultaneous intervals, and power calculations once the Wishart distribution supplies the finite-sample calibration.
Hotelling's $T^2$ shows how covariance geometry governs inference on the mean, but that same covariance matrix also describes directions of variation in the data. Principal component analysis uses it to find low-dimensional summaries that preserve as much total variability as possible.
# 6. Principal Component Analysis
Principal component analysis asks how to replace a high-dimensional random vector by a small number of linear summaries while preserving as much variation as possible. Chapters 1-4 supplied the covariance matrix, multivariate normal sampling theory, and Wishart distribution needed to make this question precise. In this chapter the geometry is spectral: principal directions are eigenvectors of a covariance or correlation matrix, and the corresponding eigenvalues measure the variance carried by those directions.
The statistical issue is twofold. At the population level PCA is an optimisation problem about covariance structure. At the sample level it becomes an estimation problem: eigenvalues, eigenvectors, scores, and biplots are computed from data and then interpreted subject to sampling variability.
## Population Principal Components
Which linear combinations of the coordinates of $X$ have the largest variance? Since a linear combination $a^\top X$ can be made arbitrarily variable by multiplying $a$ by a large constant, the direction vector must be constrained. PCA imposes the normalisation $|a|=1$ and then searches for orthogonal directions that successively maximise variance.
Let $X$ be a random vector in $\mathbb R^p$ with mean $\mu$ and covariance matrix $\Sigma$. Since $\Sigma$ is symmetric and positive semidefinite, the spectral theorem gives an orthonormal basis of eigenvectors.
[definition: Population Spectral Decomposition]
Let $\Sigma \in \mathbb R^{p \times p}$ be symmetric and non-negative definite. A population spectral decomposition of $\Sigma$ is a representation
\begin{align*}
\Sigma = \Gamma \Lambda \Gamma^\top,
\end{align*}
where $\Gamma = (\gamma_1,\dots,\gamma_p)$ is orthogonal, $\Lambda = \operatorname{diag}(\lambda_1,\dots,\lambda_p)$, and
\begin{align*}
\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p \ge 0.
\end{align*}
[/definition]
The columns $\gamma_k$ give axes in the original coordinate system. The ordering of the eigenvalues is part of the PCA convention: the first direction has the largest variance, the second direction has the next largest variance subject to being orthogonal to the first, and so on.
[definition: Population Principal Component]
Let $X$ be a random vector in $\mathbb R^p$ with mean $\mu$ and covariance matrix $\Sigma = \Gamma\Lambda\Gamma^\top$. The $k$th population principal component is the scalar random variable
\begin{align*}
Y_k = \gamma_k^\top X.
\end{align*}
The vector $Y = \Gamma^\top X$ is the vector of population principal components.
[/definition]
Centering changes the component means but not their variances. Often one writes $\gamma_k^\top(X-\mu)$ when the aim is to describe variation about the mean.
The definition names the component directions, but it remains to justify why eigenvectors are the right directions to choose. PCA is based on an optimisation problem: find a unit vector whose linear score has as much variance as possible, then find further unit directions subject to orthogonality constraints. The theorem connects that optimisation problem to the spectral decomposition of $\Sigma$, turning the geometric search for high-variance directions into an eigenvalue problem.
[quotetheorem:4030]
[citeproof:4030]
The theorem is the conceptual core of PCA: it turns a variance optimisation problem into an eigendecomposition. The unit-length hypothesis is essential, since without it the variance of $a^\top X$ can be made arbitrarily large by replacing $a$ with $ca$ and letting $|c|\to\infty$. The orthogonality constraints are also essential for the later components: if they are omitted, every maximisation problem would again select a leading eigendirection rather than discovering new directions. When eigenvalues are repeated, the individual eigenvectors inside the repeated eigenspace are not identifiable; for instance, if $\Sigma=I_2$, every unit vector in $\mathbb R^2$ is a first principal direction. Thus the theorem identifies ordered eigenspaces rather than scientifically meaningful signed axes in every case, which is why the next example interprets PCA geometrically through covariance ellipses rather than through coordinates alone.
[example: Two Dimensional Normal Geometry]
Let $X\sim \mathcal N(\mu,\Sigma)$ in $\mathbb R^2$ with
\begin{align*}
\Sigma = \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix},
\qquad -1<\rho<1.
\end{align*}
The density contours are ellipses centred at $\mu$, and the principal component directions are the major and minor axes of these ellipses. Solving the quadratic eigenvalue equation gives two eigenvalues
\begin{align*}
\lambda_{1,2}=\frac{\sigma_1^2+\sigma_2^2}{2}\pm \frac{1}{2}\sqrt{(\sigma_1^2-\sigma_2^2)^2+4\rho^2\sigma_1^2\sigma_2^2}.
\end{align*}
For the first eigenvector angle $\theta$ measured from the first coordinate axis, a direct diagonalisation gives
\begin{align*}
\tan(2\theta)=\frac{2\rho\sigma_1\sigma_2}{\sigma_1^2-\sigma_2^2},
\end{align*}
with the quadrant chosen so that the larger eigenvalue is selected. Thus PCA rotates the coordinate system until the covariance ellipse has no cross-term.
[/example]
## Variance Explained and Dimension Reduction
How much information is lost when only the first few principal components are retained? PCA answers this through variance, not through prediction error or likelihood by default. The eigenvalues form a variance budget, and choosing components means deciding how much of this budget to keep.
[quotetheorem:4031]
[citeproof:4031]
The principal components are uncorrelated by construction because the covariance matrix has been diagonalised, but this conclusion is only second-order. If $X$ is multivariate normal, then the components are also independent, because uncorrelated normal components are independent. Outside the multivariate normal case, uncorrelatedness need not imply independence: for example, if $U\sim\operatorname{Unif}(-1,1)$ and $V=U^2$, then $\operatorname{Cov}(U,V)=0$ while $V$ is determined by $U$. Thus PCA removes linear covariance, not all forms of dependence, and the variance-explained calculations that follow should be read as statements about second moments.
[definition: Proportion of Variance Explained]
Let $\lambda_1\ge\cdots\ge\lambda_p\ge0$ be the eigenvalues of a covariance matrix $\Sigma$ with $\operatorname{tr}(\Sigma)>0$. The proportion of variance explained by the $k$th principal component is
\begin{align*}
\frac{\lambda_k}{\sum_{j=1}^p\lambda_j}.
\end{align*}
The cumulative proportion of variance explained by the first $m$ components is
\begin{align*}
\frac{\sum_{k=1}^m\lambda_k}{\sum_{j=1}^p\lambda_j}.
\end{align*}
[/definition]
The denominator is the total marginal variance, since
\begin{align*}
\operatorname{tr}(\Sigma)=\sum_{j=1}^p\operatorname{Var}(X_j)=\sum_{j=1}^p\lambda_j.
\end{align*}
A scree plot displays $k\mapsto \lambda_k$ or the sample analogue $k\mapsto \hat\lambda_k$; a sharp bend suggests that later components mostly capture small residual variation. This raises the problem of whether retaining the first few spectral directions is also optimal for reconstructing the whole centred data matrix.
The variance-maximisation description of PCA should agree with the practical task of approximating a centred data matrix by a lower-rank matrix. This agreement is not automatic: the remaining problem is to prove that the same singular directions solve the Frobenius-norm reconstruction problem.
[quotetheorem:4032]
[citeproof:4032]
This theorem explains why PCA is a least-squares dimension reduction method for centred data matrices. The rank constraint is essential: without it, $B=A$ gives zero reconstruction error and there is no dimension reduction problem. The Frobenius norm is also part of the statement; changing the loss, adding sparsity constraints, or requiring interpretability of variables can lead to different low-dimensional summaries. A concrete limitation is that two small singular directions may contain a scientifically important rare signal even though the Frobenius-optimal rank-one approximation discards them. For ordinary centred PCA, however, the theorem links the spectral construction above to the singular value computation used for data matrices, which is the bridge to covariance and correlation PCA.
## Covariance PCA and Correlation PCA
Should PCA be applied to the covariance matrix or to the correlation matrix? The answer depends on the units of measurement. If variables are measured on comparable scales, covariance PCA preserves the original variance structure; if not, variables with large units can dominate the first components.
[definition: Correlation Matrix]
Let $X$ have covariance matrix $\Sigma$ with positive diagonal entries $\sigma_{11},\dots,\sigma_{pp}$. Let
\begin{align*}
D=\operatorname{diag}(\sigma_{11}^{1/2},\dots,\sigma_{pp}^{1/2}).
\end{align*}
The correlation matrix of $X$ is
\begin{align*}
\rho = D^{-1}\Sigma D^{-1}.
\end{align*}
[/definition]
PCA of $\rho$ is the same as PCA of the standardised variables $Z=D^{-1}(X-\mu)$. It therefore gives each original coordinate variance one before forming components. The remaining question is which coordinate changes preserve covariance PCA itself, before arbitrary rescaling changes the variance ordering.
Before comparing covariance and correlation PCA, it is useful to separate genuine coordinate rotations from arbitrary rescalings of variables. Orthogonal coordinate changes preserve lengths and angles, so we need covariance PCA to transform equivariantly under them if the spectral interpretation is intrinsic.
[quotetheorem:4033]
[citeproof:4033]
This equivariance is special to orthogonal changes of coordinates, because orthogonal maps preserve Euclidean lengths, angles, and the variance-maximisation constraint $|a|=1$. The hypothesis that $Q$ is orthogonal cannot be replaced by an arbitrary invertible matrix: if $\Sigma=I_2$ and the first coordinate is rescaled by $10$, the transformed covariance matrix becomes $\operatorname{diag}(100,1)$ and the first coordinate is now distinguished. Thus covariance PCA is coordinate-scale dependent even though it is rotation and reflection equivariant. This limitation is exactly what motivates comparing covariance PCA with correlation PCA in the next example.
[example: Standardisation Changes the First Component]
Consider two measurements $X_1$ and $X_2$ with covariance matrix
\begin{align*}
\Sigma=\begin{pmatrix}100&0\\0&1\end{pmatrix}.
\end{align*}
Covariance PCA chooses the first coordinate as the first principal component, with proportion of variance explained
\begin{align*}
\frac{100}{101}.
\end{align*}
The correlation matrix is $I_2$, so after standardisation every unit direction has the same variance and there is no distinguished first component. This example shows that standardisation is not a cosmetic preprocessing step; it changes the optimisation problem.
[/example]
Standardise when the variables are in different units, when relative rather than absolute variation is the target, or when no variable should dominate merely because of its scale. Do not standardise automatically when the magnitude of variation is substantively meaningful, as with measurements recorded in a common physical unit.
## Sample Principal Components
In applications $\Sigma$ is unknown, so PCA is computed from data. The sample version replaces the population covariance matrix by the sample covariance matrix and then repeats the same spectral construction. This gives estimated directions, component scores for each observation, and reconstructed low-dimensional approximations.
Let $x_1,\dots,x_n\in\mathbb R^p$ be observations, and write
\begin{align*}
\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i,
\qquad
S=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top.
\end{align*}
[definition: Sample Principal Components]
Let
\begin{align*}
S=\hat\Gamma\hat\Lambda\hat\Gamma^\top,
\qquad
\hat\Lambda=\operatorname{diag}(\hat\lambda_1,\dots,\hat\lambda_p),
\qquad
\hat\lambda_1\ge\cdots\ge\hat\lambda_p\ge0.
\end{align*}
The columns $\hat\gamma_1,\dots,\hat\gamma_p$ of $\hat\Gamma$ are the sample loading vectors. The $k$th sample score for observation $i$ is
\begin{align*}
\hat y_{ik}=\hat\gamma_k^\top(x_i-\bar{x}).
\end{align*}
[/definition]
The loadings describe directions in variable space, while the scores describe where observations sit along those directions. This distinction becomes important in biplots and in scientific interpretation.
Since sample PCA is used to infer population directions, we need to know when empirical eigenvalues and loadings stabilise as more observations are collected. The following consistency result gives the fixed-dimension justification for reading sample components as estimates of population components.
[quotetheorem:4034]
[citeproof:4034]
The fixed-dimension assumption is doing real work here: entrywise convergence controls the whole matrix only when $p$ is not growing with $n$. The finite second moment assumption is also necessary for the sample covariance to settle down by the law of large numbers; heavy-tailed data with infinite variance can make the empirical eigenvalues unstable. Simplicity of $\lambda_k$ is needed for convergence of a particular eigenvector, since for $\Sigma=I_2$ any orthonormal basis diagonalises the covariance and no single population direction is distinguished. Even when the eigenvalue is simple, the sign convention matters because eigenvectors are only defined up to multiplication by $-1$: changing the sign reverses all scores on that component but leaves fitted subspaces, variances, and reconstructions unchanged.
[example: Gene Expression Matrix]
Let $A\in\mathbb R^{n\times p}$ record centred expression levels for $p$ genes across $n$ tissue samples, with $p$ potentially much larger than $n$. Sample PCA can be computed from the singular value decomposition $A=UDV^\top$ without forming the large $p\times p$ matrix $S$. The columns of $V$ are gene loading directions, while the columns of $UD$ give sample scores. A two-component score plot can reveal clustering by tissue type, and the largest positive and negative entries of a loading vector identify genes contributing most strongly to that component.
[/example]
## Asymptotic Distribution of Sample Eigenvalues
Consistency says that sample eigenvalues converge, but it does not quantify their sampling fluctuation. In fixed dimension, multivariate normal sampling theory gives a central limit theorem for the eigenvalues of $S$. The simplest form is obtained when the population eigenvalues are distinct.
[quotetheorem:4035]
[citeproof:4035]
The theorem is often used to attach approximate uncertainty to variance-explained estimates, but each hypothesis marks a boundary of the approximation. Distinctness is needed because the ordered eigenvalue map is not differentiable in the same way at a tied block; for example, if $\Sigma=I_2$, the two sample eigenvalues are ordered random perturbations of a repeated eigenvalue rather than independent differentiable coordinates. Positivity excludes degenerate directions, where a zero population eigenvalue can force constrained or nonstandard limits. Normality gives the Wishart covariance structure used in the variance $2\lambda_k^2$; for non-normal data, fourth moments enter the limiting variance. Fixed dimension is also essential, since high-dimensional regimes with $p/n$ not small have biased and spread-out sample eigenvalues governed by different random matrix limits.
[example: Approximate Standard Error of a Leading Eigenvalue]
Suppose normal data have a well-separated leading population eigenvalue $\lambda_1$ and the sample estimate is $\hat\lambda_1$. Anderson theorem suggests the approximation
\begin{align*}
\hat\lambda_1 \approx \mathcal N\left(\lambda_1,\frac{2\lambda_1^2}{n}\right).
\end{align*}
Replacing $\lambda_1$ by $\hat\lambda_1$ gives the estimated standard error
\begin{align*}
\frac{\sqrt{2}\hat\lambda_1}{\sqrt{n}}.
\end{align*}
This calculation is most reliable in fixed dimension with a large sample size and a leading eigenvalue separated from the rest.
[/example]
## Choosing the Number of Components
How many principal components should be retained? There is no universal rule because PCA itself does not specify a downstream loss function. The common rules are heuristics unless they are tied to a modelling or prediction goal.
[definition: Kaiser Rule]
For PCA based on a correlation matrix, the Kaiser rule retains components with sample eigenvalues greater than $1$.
[/definition]
The rule is motivated by the fact that the average eigenvalue of a $p\times p$ correlation matrix is $1$. A retained component then explains more variance than an average standardised original variable.
A different heuristic starts from a target amount of retained variation rather than from a comparison with one average variable. This leads to a threshold rule based on cumulative explained variance.
[definition: Cumulative Variance Threshold]
Given a threshold $\alpha\in(0,1)$, the cumulative variance rule chooses the smallest $m$ such that
\begin{align*}
\frac{\sum_{k=1}^m\hat\lambda_k}{\sum_{j=1}^p\hat\lambda_j}\ge \alpha.
\end{align*}
[/definition]
Typical thresholds such as $0.80$, $0.90$, or $0.95$ are context-dependent. They are easier to justify when the goal is compression than when the goal is inference about scientifically meaningful latent structure.
When the PCA representation is chosen for prediction or reconstruction, variance explained is only a proxy for the actual objective. A task-based choice instead estimates which dimension performs best on data not used to fit the components.
This requires a separate definition because the dimension is no longer selected from the eigenvalue table alone. The formal object records the validation procedure and the loss whose future performance is being estimated.
[definition: Cross Validated PCA Dimension]
A cross validated PCA dimension is a value $m$ chosen to minimise an estimated out-of-sample reconstruction or prediction loss computed by fitting PCA on training data and evaluating the loss on held-out data.
[/definition]
Cross-validation forces the dimension choice to be tied to a specific task. For missing-value reconstruction, held-out entries may be masked; for prediction, the score representation can be fed into a supervised model and evaluated on held-out responses.
[example: Comparing Dimension Rules]
Suppose correlation PCA gives sample eigenvalues
\begin{align*}
3.2,1.4,0.9,0.3,0.2.
\end{align*}
The Kaiser rule keeps two components, since only $3.2$ and $1.4$ exceed $1$. The cumulative variance explained by two components is
\begin{align*}
\frac{3.2+1.4}{6.0}\approx0.767,
\end{align*}
while three components explain
\begin{align*}
\frac{5.5}{6.0}\approx0.917.
\end{align*}
A $90\%$ cumulative threshold would therefore keep three components, illustrating that common rules can disagree.
[/example]
## Loadings, Scores, and Biplots
Once PCA has been computed, how should the output be read? The two main objects are loadings and scores. Loadings describe variables, scores describe observations, and biplots display both in a common low-dimensional coordinate system.
[definition: Loadings and Scores]
For sample PCA with loading vectors $\hat\gamma_k$, the loading of variable $j$ on component $k$ is the $j$th coordinate $(\hat\gamma_k)_j$. The score of observation $i$ on component $k$ is
\begin{align*}
\hat y_{ik}=\hat\gamma_k^\top(x_i-\bar{x}).
\end{align*}
[/definition]
Large positive or negative loadings indicate variables that contribute strongly to a component. Since the sign of a component is arbitrary, interpretation should focus on relative signs and magnitudes within the same component, not on the absolute sign of a single loading.
A table of scores describes observations and a table of loadings describes variables, but neither table alone shows how the two descriptions interact. To compare clusters of observations with the variables that drive the displayed directions, both objects need a common geometric display. The biplot formalises that combined view.
[definition: PCA Biplot]
A PCA biplot is a two-dimensional display, usually using the first two principal components, that plots observation scores together with variable vectors derived from the corresponding loadings.
[/definition]
In a score plot, nearby observations have similar coordinates in the retained component space. In a loading plot, variables pointing in similar directions tend to be positively associated in the reduced representation, while variables pointing in opposite directions tend to be negatively associated.
[example: Reading a Two Component Biplot]
Suppose the first component of a standardised data set has positive loadings of comparable size on variables measuring body size, while the second component contrasts a positive loading on endurance with a negative loading on resting heart rate. An observation far to the right has high overall size scores, while an observation high on the second axis has the endurance-versus-heart-rate contrast in the positive direction. If two variable arrows have a small angle between them, their projected scores tend to move together in the two-component approximation.
[/example]
The main caution is that a biplot is only a projection. Distances, angles, and clusters are trustworthy only to the extent that the first two components explain the relevant structure. A biplot should therefore be read together with the scree plot and the cumulative variance table.
PCA compresses variation through orthogonal linear combinations, but it does not claim that the observed variables are driven by hidden causes. Factor analysis takes the next step by asking whether a smaller set of latent variables, together with specific noise, can explain the covariance structure more directly.
# 7. Factor Analysis
Factor analysis starts from a question that PCA deliberately avoids: can the covariance matrix of many observed variables be explained by a smaller number of unobserved causes plus variable-specific noise? Chapter 6 used principal components as orthogonal directions that summarise variance, but it did not assert a probability model for how the observations were generated. This chapter introduces the Gaussian factor model, studies the non-uniqueness of its parameters, and explains how rotations and factor scores are used to obtain interpretable low-dimensional structure. The guiding theme is that covariance structure can be explained only after we separate shared variation from coordinate-specific variation.
## The Gaussian Factor Model
What kind of probabilistic model says that many measured variables share a few common sources of variation? The factor model writes each observed vector as a linear function of latent factors plus residual variation that is specific to each coordinate.
[definition: Gaussian Factor Model]
Let $X$ be a centred random vector in $\mathbb R^p$. A $k$-factor Gaussian factor model for $X$ consists of a loading matrix $\Lambda \in \mathbb R^{p \times k}$, a latent random vector $F \sim \mathcal N(0,I_k)$, and an error vector $\epsilon \sim \mathcal N_p(0,\Psi)$ such that
\begin{align*}
X = \Lambda F + \epsilon,
\end{align*}
where $F$ and $\epsilon$ are independent and $\Psi = \operatorname{diag}(\psi_1,\dots,\psi_p)$ with $\psi_i > 0$.
[/definition]
The entries of $F$ are the common factors. The diagonal form of $\Psi$ encodes the modelling assumption that all cross-covariance between observed variables is explained by the shared factors, rather than by correlated residual errors.
The next issue is what covariance structure this modelling equation actually imposes. Before estimating or interpreting factors, we need the exact population covariance decomposition forced by independence and diagonal specific variance.
[quotetheorem:4036]
[citeproof:4036]
This theorem is the mathematical core of factor analysis. The independence hypothesis is doing real work: if $F$ and $\epsilon$ were correlated, mixed covariance terms such as $\operatorname{Cov}(\Lambda F,\epsilon)$ would appear and the decomposition would no longer be simply low-rank plus diagonal. The diagonal hypothesis on $\Psi$ is also essential; with correlated residual errors, off-diagonal covariance could be hidden in the noise rather than explained by common factors. The theorem does not say that every covariance matrix has a unique decomposition of this form, and the next section studies the first source of non-uniqueness: rotation of the latent factor axes.
[example: Residual Correlation Breaks The Factor Interpretation]
Suppose $p=4$ and $k=2$, with the first two observed variables loading mainly on the first factor and the last two loading mainly on the second. A matrix such as
\begin{align*}
\Lambda =
\begin{pmatrix}
0.8 & 0.1\\
0.7 & 0.2\\
0.1 & 0.6\\
0.2 & 0.7
\end{pmatrix}
\end{align*}
produces large covariance between variables $1$ and $2$, large covariance between variables $3$ and $4$, and smaller covariance across the two pairs. This example teaches the diagnostic use of off-diagonal residuals: if the empirical covariance between variables $1$ and $3$ remains large after subtracting $\Lambda\Lambda^\top$, then the diagonal-noise hypothesis is failing rather than merely the numerical loading values. Adding a positive diagonal matrix $\Psi$ changes the variances but cannot repair such an off-diagonal residual, so this pattern separates a genuine two-factor explanation from unexplained correlated noise.
[/example]
## Rotational Indeterminacy and Identification
If the covariance depends on $\Lambda$ only through $\Lambda\Lambda^\top$, can the loading matrix itself be recovered? The answer is no without further constraints, because the axes of the latent factor space can be rotated without changing the distribution of $X$.
[quotetheorem:4037]
[citeproof:4037]
Rotational indeterminacy is not a defect in computation; it is a structural non-identifiability of the model. Orthogonality is the hypothesis that preserves the convention $\operatorname{Cov}(F)=I_k$; if $T$ were not orthogonal, then $TT^\top$ would replace $I_k$ and the common-factor covariance would usually change under the same normalisation. For example, scaling one factor axis by $2$ and leaving the factor covariance fixed would multiply the corresponding rank-one contribution by $4$, so the covariance would not be preserved. The theorem does not say that all rotations are equally interpretable, only that the likelihood and covariance cannot choose between orthogonal coordinate systems. This motivates explicit identification conventions and later interpretive rotations.
[definition: Identification Constraint]
An identification constraint for the factor model is an additional condition imposed on $\Lambda$ and $\Psi$ so that a representative loading matrix is selected from the class of observationally equivalent loadings.
[/definition]
Common constraints include triangular restrictions on $\Lambda$, sign conventions on selected loadings, or likelihood-based normalisations. In maximum likelihood estimation for the Gaussian factor model, a standard identification condition is that $\hat\Lambda^\top\hat\Psi^{-1}\hat\Lambda$ is diagonal. The question is whether this condition is an arbitrary reporting convention or a constraint naturally selected by the likelihood equations.
A useful identification convention should not be merely cosmetic; it should be compatible with the likelihood equations at an interior optimum. We need to know whether a Gaussian maximum likelihood fit forces the weighted cross-products of the selected loading columns into the diagonal form used to choose a representative among rotationally equivalent solutions.
[quotetheorem:4038]
[citeproof:4038]
The positivity of $\hat\Psi$ matters because $\hat\Psi^{-1}$ must exist; if a uniqueness estimate were zero, the displayed weighted cross-product would not even be defined. The orthogonal-factor convention also matters, since arbitrary non-orthogonal transformations would alter the normalisation of the latent covariance unless the factor covariance were changed at the same time. The theorem does not identify the true loading matrix from data, and it does not choose the most interpretable axes. It only supplies a conventional MLE representative, which later rotations such as Varimax may intentionally abandon to obtain a simpler loading pattern.
## Communalities and Specific Variances
How much of the variance of a measured variable is shared with the common factors? The factor model separates the variance of each coordinate into a common part, called communality, and a residual part, called specific variance.
[definition: Communality]
For a loading matrix $\Lambda=(\lambda_{ij})\in\mathbb R^{p\times k}$, the communality of variable $i$ is
\begin{align*}
h_i^2 = \sum_{j=1}^k \lambda_{ij}^2.
\end{align*}
[/definition]
Communality only records the part of a variable explained by the common factors. To decide whether a variable is weakly explained because it is mostly noise rather than because all variables have small loadings, we also need the residual variance left on that coordinate.
[definition: Specific Variance]
In the Gaussian factor model with $\Psi=\operatorname{diag}(\psi_1,\dots,\psi_p)$, the specific variance of variable $i$ is $\psi_i$.
[/definition]
Without the communality-specific variance split, a large marginal variance can be misread as evidence that a variable is central to the factor structure. A variable may have high total variance because it is strongly tied to common factors, or because it is mostly noisy and idiosyncratic. The diagonal entries of $\Sigma=\Lambda\Lambda^\top+\Psi$ distinguish these cases:
\begin{align*}
\operatorname{Var}(X_i)=h_i^2+\psi_i.
\end{align*}
Thus $h_i^2$ measures the part of the variance attributed to the common factors, while $\psi_i$ measures variation unique to the variable and measurement noise.
[example: High Variance Need Not Mean High Communality]
Compare two standardised variables in a fitted two-factor model. For the first row $(0.8,0.1)$, the communality is $h_1^2=0.8^2+0.1^2=0.65$, so if $\psi_1=0.35$ then $65\%$ of the variance is explained by the common factors. For another variable with loading row $(0.2,0.1)$ and $\psi_2=0.95$, the total variance is still $1.00$ after rescaling, but its communality is only $0.05$. This example gives a reusable diagnostic: inspect communalities before interpreting high-variance variables as substantively important factors.
[/example]
A high communality means that the factor model accounts for most of the variation in that coordinate. A high specific variance signals either measurement noise, a variable-specific source of variation, or a poor fit of the chosen number of factors.
## Estimating the Factor Model
Given a sample covariance matrix $S$, how should $\Lambda$ and $\Psi$ be estimated? The two classical approaches are the principal factor method, which is computational and spectral, and maximum likelihood, which treats the Gaussian factor model as a full statistical model.
[definition: Principal Factor Method]
The principal factor method estimates the common covariance part by replacing the diagonal of the sample covariance or correlation matrix with initial communality estimates, extracting the leading $k$ eigencomponents, and iterating the diagonal communalities if desired.
[/definition]
The method is close in spirit to PCA but changes the diagonal before extracting eigenvectors. This reflects the factor-analysis goal: explain common covariance, not total variance.
[example: Principal Factor Starting Point]
For standardised variables, the sample correlation matrix has diagonal entries equal to $1$. A principal factor analysis may replace the $i$th diagonal entry by an initial estimate such as the squared multiple correlation of variable $i$ against the remaining variables. The leading two eigencomponents of this adjusted matrix then give a first two-factor loading matrix whose row norms estimate communalities.
[/example]
Maximum likelihood instead chooses parameters that make the observed sample covariance most probable under the Gaussian model. For observations $X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \mathcal N_p(0,\Sigma)$, the log-likelihood depends on $\Lambda$ and $\Psi$ through $\Sigma=\Lambda\Lambda^\top+\Psi$.
To compare candidate loading and uniqueness matrices, the estimation problem needs a single objective function on the admissible parameter space. The log-likelihood definition supplies that objective and makes explicit which covariance matrices are allowed.
[definition: Factor Analysis Log-Likelihood]
For sample covariance matrix $S\in\mathbb R^{p\times p}$, the Gaussian factor analysis log-likelihood is the map
\begin{align*}
\ell: \mathcal P_k &\to \mathbb R,\\
(\Lambda,\Psi) &\mapsto -\frac{n}{2}\left\{\log\det(\Lambda\Lambda^\top+\Psi)+\operatorname{tr}\left(S(\Lambda\Lambda^\top+\Psi)^{-1}\right)\right\},
\end{align*}
up to constants not depending on the parameters, where $\mathcal P_k$ is the set of pairs $(\Lambda,\Psi)$ with $\Lambda\in\mathbb R^{p\times k}$, $\Psi$ positive diagonal, and $\Lambda\Lambda^\top+\Psi$ positive definite.
[/definition]
This likelihood is usually optimised numerically. The EM algorithm treats the latent factors $F_1,\dots,F_n$ as missing data: the E-step computes their conditional moments given the observations and current parameters, and the M-step updates $\Lambda$ and $\Psi$ using least-squares-type formulae.
For this algorithm to be well-defined, the missing-factor conditional distribution must be computable from the fitted Gaussian blocks. The E-step needs conditional first and second moments of the latent factors, so the Gaussian conditioning formula becomes the computational engine of the algorithm.
[quotetheorem:4039]
[citeproof:4039]
The positive-definiteness hypothesis is needed because the conditioning formula uses $\Sigma^{-1}$; if $\Sigma$ were singular, some linear combination of observed variables would have zero variance and the conditional density calculation would break down. The Gaussian hypothesis is also essential for these formulae to describe the full conditional distribution; with non-Gaussian latent factors, the same covariance blocks would determine only the best linear predictor, not necessarily $\mathbb E[F\mid X=x]$. The theorem does not make the latent factors observed, and it does not remove rotational dependence from the scores. It prepares the EM algorithm by identifying exactly which conditional first and second moments must be averaged in the E-step.
The number of factors $k$ can be assessed by comparing nested covariance models. Under regularity conditions and away from boundary cases, the likelihood ratio statistic compares the fitted $k$-factor model with the unrestricted covariance model. This comparison asks whether the restrictions imposed by a small number of factors leave a detectable lack of fit in the covariance matrix. The theorem below gives the large-sample calibration for that lack-of-fit statistic and makes explicit which dimension count becomes the degrees of freedom.
[quotetheorem:4040]
[citeproof:4040]
The positive-definiteness assumptions are not cosmetic: if $S$ is singular, the saturated Gaussian likelihood is not represented by the displayed finite determinant expression. The interior and non-boundary assumptions are also needed; for example, a Heywood case with an estimated uniqueness at zero can give a nonstandard limiting distribution. The theorem does not say that the model with the first nonsignificant test is scientifically correct, because factor models are often deliberate approximations. It is best used alongside residual correlations, interpretability, and stability under rotation, which are the practical checks developed next.
## Rotating Factors for Interpretability
Once a factor subspace has been fitted, which axes should be used to describe it? Rotation methods choose new factor axes to make the loading matrix easier to read, often by encouraging each variable to load strongly on only a small number of factors.
[definition: Orthogonal Rotation]
An orthogonal rotation replaces $\Lambda$ by $\Lambda T$, where $T\in\mathbb R^{k\times k}$ satisfies $T^\top T=I_k$.
[/definition]
Orthogonal rotations preserve the fitted covariance matrix and keep the factors uncorrelated. They change the visual and substantive interpretation of the columns of $\Lambda$.
Preservation alone does not say which rotation is easier to read. A rotation criterion is needed to turn the qualitative goal of simple structure into a numerical optimisation problem.
[definition: Varimax Criterion]
For fixed $p,k\in\mathbb N$, the Varimax criterion is the map
\begin{align*}
V: \mathbb R^{p\times k} &\to \mathbb R,\\
A=(a_{ij}) &\mapsto \sum_{j=1}^k\left\{\frac{1}{p}\sum_{i=1}^p a_{ij}^4-\left(\frac{1}{p}\sum_{i=1}^p a_{ij}^2\right)^2\right\}.
\end{align*}
Varimax rotation chooses an orthogonal rotation $T\in\mathbb R^{k\times k}$ that increases or maximises $V(\Lambda T)$.
[/definition]
Without a rotation criterion, the same fitted factor subspace can be presented in a diffuse coordinate system where most variables load moderately on most factors. The expression is a sum of column-wise variances of squared loadings. Large values occur when, within each factor column, some variables have large squared loadings and many have small squared loadings.
[example: Varimax Separates A Diffuse Equivalent Solution]
A two-factor solution for mental ability tests may initially have many tests loading moderately on both factors, making neither column a clean verbal or spatial factor. Varimax demonstrates a reusable technique: search within the same orthogonal equivalence class for axes that concentrate squared loadings. After rotation, spatial tests may load strongly on one factor while verbal tests load strongly on another, with smaller cross-loadings. The fitted covariance matrix is unchanged, so the example isolates interpretability of coordinates from statistical fit of the covariance model.
[/example]
Orthogonal rotation may still be too restrictive when the scientific factors are expected to be associated. We need a rotation method that pursues simple structure while allowing correlated factor axes, so the transformation has to leave the orthogonal equivalence class.
[definition: Promax Rotation]
Promax rotation is an oblique rotation method that first constructs a simple target pattern, often from a Varimax solution, and then allows a non-orthogonal transformation of the factor axes toward that target.
[/definition]
Oblique rotations allow the resulting factors to be correlated. This can be scientifically natural, for instance when verbal and spatial abilities are distinct but positively associated.
## Factor Scores
After estimating a factor model, how can latent factor values be assigned to individual observations? A common mistake is to treat the fitted latent factors as if they had been measured directly, but many different latent values can be compatible with the same noisy observation. Since the factors are unobserved random variables, factor scores are predictions or estimates, not observed data.
[definition: Regression Factor Score]
Under a fitted Gaussian factor model with $\hat\Sigma=\hat\Lambda\hat\Lambda^\top+\hat\Psi$ positive definite, the regression factor score is the map
\begin{align*}
\hat F_{\mathrm{reg}}: \mathbb R^p &\to \mathbb R^k,\\
x &\mapsto \hat\Lambda^\top\hat\Sigma^{-1}x.
\end{align*}
[/definition]
This is the conditional expectation of $F$ given $X=x$ under the fitted model. It minimises mean squared prediction error among all measurable predictors when the fitted Gaussian model is treated as correct.
A different scoring rule is useful when the measurement equation is treated as a weighted regression problem rather than as a posterior mean calculation. That perspective leads to a score that emphasises variables with smaller specific variances.
[definition: Bartlett Factor Score]
For a fitted factor model with $\hat\Psi$ positive diagonal and $\hat\Lambda^\top\hat\Psi^{-1}\hat\Lambda$ invertible, the Bartlett factor score is the map
\begin{align*}
\hat F_{\mathrm{Bart}}: \mathbb R^p &\to \mathbb R^k,\\
x &\mapsto \left(\hat\Lambda^\top\hat\Psi^{-1}\hat\Lambda\right)^{-1}\hat\Lambda^\top\hat\Psi^{-1}x.
\end{align*}
[/definition]
The Bartlett method estimates factors by weighted least squares in the measurement equation $x=\Lambda f+\epsilon$, using the inverse specific variances as weights. Variables with large specific variances receive less influence in the score.
[example: Regression Versus Bartlett Scores]
If one test has high uniqueness $\psi_i$, the Bartlett score downweights its residual-heavy measurement when solving for the latent factor vector. The regression score also shrinks scores toward zero through the prior covariance $\operatorname{Cov}(F)=I_k$. In small samples, the two scoring rules can rank individuals similarly while giving different score scales.
[/example]
Factor scores should not be confused with the latent factors themselves. They depend on the fitted loading matrix, the chosen rotation, and the scoring rule.
## Factor Analysis and Principal Component Analysis
Why introduce factor analysis when PCA already gives a low-dimensional representation? The difference is that PCA is a descriptive spectral decomposition of total variance, whereas factor analysis is a statistical model for covariance structure.
[definition: Principal Component Representation]
For a covariance matrix $\Sigma$, principal component analysis represents variation through eigenvectors of $\Sigma$ and orders the components by their eigenvalues.
[/definition]
PCA has no diagonal uniqueness term and no latent error model. This matters when the residual covariance is scientifically meaningful: a rank-$k$ PCA approximation can leave structured off-diagonal residuals, while a well-fitting factor model should leave mostly diagonal residual variation. If the first $k$ components are retained, the discarded components are not interpreted as variable-specific noise; they are simply directions of smaller variance.
The distinction becomes clearest when both methods are described as covariance decompositions. A factor-analysis explanation requires the unexplained covariance to be variable-specific, so the residual term must be diagonal rather than an arbitrary leftover low-rank error.
[quotetheorem:4041]
[citeproof:4041]
The diagonal residual hypothesis is what makes the contrast substantive. If the residual after a low-rank approximation has large off-diagonal entries, PCA may still be a good compression method, but it has not produced a factor-analysis explanation of common covariance. Conversely, a factor model with diagonal $\Psi$ may fit less total variance than PCA while giving a more interpretable account of shared causes. This distinction connects factor analysis to latent variable modelling, covariance geometry, identifiability in statistical models, and graphical models where conditional independence assumptions are encoded through covariance structure.
[example: Holzinger Swineford Mental Ability Data]
In the Holzinger--Swineford mental ability data, tests of related abilities are expected to share latent sources of variation. The structural comparison is between two explanations of the same covariance matrix: PCA treats every retained direction as part of total variance, while a two-factor analysis separates shared verbal or spatial variation from test-specific uniqueness. A useful workflow is to fit the factor model, inspect residual correlations, and then compare the unrotated and Varimax-rotated loading matrices. If rotation preserves fit while stabilising verbal and spatial labels, the example shows why factor analysis is a latent covariance model rather than merely a dimension-reduction display.
[/example]
The practical workflow is therefore: choose a plausible number of factors, estimate the model, inspect communalities and uniquenesses, rotate for interpretability, compute factor scores only if individual-level summaries are needed, and compare the result with PCA as a descriptive benchmark rather than as the same model in another form.
Factor analysis shifts attention from variance summarization to latent structure, and that makes it natural to ask how two multivariate sets relate to each other. Canonical correlation analysis answers that question by finding paired linear combinations, one from each side, that are maximally associated.
# 8. Canonical Correlation Analysis
Canonical correlation analysis asks how strongly two random vectors can be linearly related after we are allowed to choose the best linear summary of each vector. Chapters 6 and 7 studied covariance structure inside one vector, through principal components and factor models. Here the object is cross-covariance: given two blocks of variables, we seek paired directions whose projected scores have maximal correlation.
The method is geometric. Each side supplies a family of linear variates, and CCA finds the two one-dimensional subspaces, one in each block, that are closest in correlation. After the first pair is found, the next pairs are constrained to be uncorrelated with previous ones, producing a sequence of canonical variates and canonical correlations.
## Partitioned Random Vectors and Cross-Covariance
What information about the joint distribution of two vector-valued measurements is relevant if we only care about linear relationships between the two blocks? The answer is contained in the two marginal covariance matrices and the cross-covariance matrix.
[definition: Partitioned Covariance Matrix]
Let $X = (X_1^\top, X_2^\top)^\top$ be a random vector with $X_1 \in \mathbb R^p$ and $X_2 \in \mathbb R^q$. Suppose $\mathbb E[X] = \mu$ and $\operatorname{Cov}(X)$ exists. Write
\begin{align*}
\mu &= (\mu_1^\top, \mu_2^\top)^\top,\\
\Sigma &= \operatorname{Cov}(X)
= \begin{pmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{pmatrix},
\end{align*}
where $\Sigma_{11}=\operatorname{Cov}(X_1)$, $\Sigma_{22}=\operatorname{Cov}(X_2)$, $\Sigma_{12}=\operatorname{Cov}(X_1,X_2)$, and $\Sigma_{21}=\Sigma_{12}^\top$.
[/definition]
The matrices $\Sigma_{11}$ and $\Sigma_{22}$ measure variation within each block, while $\Sigma_{12}$ measures linear association across blocks. Throughout this chapter, unless stated otherwise, assume $\Sigma_{11}$ and $\Sigma_{22}$ are positive definite so that every nonzero linear combination has positive variance.
[example: Cross-Covariance Can Hide in Linear Combinations]
Let $p=q=2$, suppose $\Sigma_{11}=I_2$ and $\Sigma_{22}=I_2$, and take
\begin{align*}
\Sigma_{12}=
\begin{pmatrix}
1/2 & 1/2\\
1/2 & 1/2
\end{pmatrix}.
\end{align*}
Each individual cross-covariance entry is only $1/2$, but the linear combinations $a=(1,1)^\top/\sqrt{2}$ and $b=(1,1)^\top/\sqrt{2}$ satisfy $a^\top\Sigma_{12}b=1$. Thus the first canonical correlation is $1$, showing that perfect linear association may be spread across several measured variables rather than appearing in a single coordinate pair.
[/example]
The example shows why covariance alone is not enough: scaling a test changes covariance but not correlation. Canonical correlation therefore normalises the projected variates to have unit variance.
## Population Canonical Correlations
Which pair of linear combinations has the largest possible correlation? Since correlation is unchanged by rescaling either linear combination, the optimisation can be written with variance constraints.
[definition: First Population Canonical Correlation]
The first population canonical correlation between $X_1$ and $X_2$ is
\begin{align*}
\rho_1
= \max_{a \in \mathbb R^p,\, b \in \mathbb R^q}
\operatorname{Corr}(a^\top X_1,b^\top X_2)
\end{align*}
subject to
\begin{align*}
a^\top \Sigma_{11}a = 1,
\qquad
b^\top \Sigma_{22}b = 1.
\end{align*}
[/definition]
With the variance constraints imposed, the objective becomes $a^\top \Sigma_{12}b$. The problem is therefore a constrained bilinear maximisation problem, and whitening each block turns it into an ordinary singular-value problem.
[quotetheorem:4042]
[citeproof:4042]
This theorem is the central reduction of CCA: the statistical problem is converted into the spectral analysis of a whitened cross-covariance matrix. The positive-definiteness hypotheses are doing real work. They ensure that $\Sigma_{11}^{-1/2}$ and $\Sigma_{22}^{-1/2}$ exist and that the constraints $a^\top\Sigma_{11}a=1$ and $b^\top\Sigma_{22}b=1$ define genuine unit-variance normalisations rather than degenerating along directions of zero variance.
The theorem identifies the possible correlations, but it does not by itself make the directions statistically stable or uniquely interpretable. Those issues depend on the singular-value structure and, in samples, on the conditioning of the covariance blocks. This is why the population spectral formula leads naturally to two later questions: whether the directions are identifiable, and how the same construction behaves when $\Sigma_{ij}$ is replaced by $S_{ij}$.
[example: Singular Block Failure]
Suppose $X_1=(Z,Z)^\top$ with $\operatorname{Var}(Z)>0$. Then
\begin{align*}
\Sigma_{11}=\operatorname{Var}(Z)
\begin{pmatrix}
1&1\\
1&1
\end{pmatrix}
\end{align*}
is singular, and the nonzero vector $a=(1,-1)^\top$ has $a^\top X_1=0$ a.s. The constraint $a^\top\Sigma_{11}a=1$ cannot be imposed in this null direction, and $\Sigma_{11}^{-1/2}$ is not defined as an ordinary inverse square root. This example shows that singular covariance blocks are not a harmless technicality; they collapse distinct coefficient vectors to the same random variate.
[/example]
With the spectral reduction in place, the next issue is identifiability. The correlations may be well-defined even when the corresponding directions are not uniquely determined.
[quotetheorem:4043]
[citeproof:4043]
Existence of the correlations and identifiability of the directions are separate matters. Positive definiteness gives a compact whitened optimisation problem, so the ordered singular values always exist. Distinctness is only an identifiability condition for the associated one-dimensional directions.
When singular values repeat, the corresponding canonical subspace is identifiable, but a particular basis inside that subspace is not determined by the population covariance matrix alone. For example, if $p=q=2$, $\Sigma_{11}=\Sigma_{22}=I_2$, and $\Sigma_{12}=rI_2$ with $0<r<1$, then both canonical correlations equal $r$. Any common orthonormal basis of $\mathbb R^2$ gives valid canonical pairs, so the two-dimensional canonical subspace is determined but the individual labelled directions are not.
## Canonical Variates and Orthogonality
After finding the strongest pair of linear combinations, how do we organise the rest of the cross-block association without repeating the same signal? The answer is to choose subsequent variates that are uncorrelated within each block with the previous ones.
[definition: Population Canonical Variates]
Let $X_1:\Omega\to\mathbb R^p$ and $X_2:\Omega\to\mathbb R^q$ be the two coordinate blocks of $X$, and let $a_k \in \mathbb R^p$ and $b_k \in \mathbb R^q$ be canonical direction vectors for the $k$th canonical correlation. The corresponding population canonical variates are the real-valued random variables $U_k:\Omega\to\mathbb R$ and $V_k:\Omega\to\mathbb R$ defined by
\begin{align*}
U_k(\omega) = a_k^\top X_1(\omega),
\qquad
V_k(\omega) = b_k^\top X_2(\omega).
\end{align*}
[/definition]
The normalisation convention is that each canonical variate has variance one. This makes the canonical correlations directly comparable across $k$. Once the variates are standardised, the important structural question is whether later pairs really remove the association already captured by earlier pairs. The desired result is a diagonal covariance pattern: different canonical variates within the same block should be uncorrelated, and cross-block covariance should survive only in matched pairs.
[quotetheorem:4044]
[citeproof:4044]
The variates therefore diagonalise the linear dependence between the two blocks: after transformation, the only nonzero cross-covariances are the paired ones. This statement depends on using the covariance inner products supplied by positive definite $\Sigma_{11}$ and $\Sigma_{22}$. If either block has a null direction, the corresponding linear combination has zero variance and cannot be converted into a unit-variance canonical variate without first reducing dimension or regularising.
The theorem is also a second reminder that CCA is a second-moment method. Orthogonality here means zero covariance, not statistical independence unless additional assumptions such as joint normality are imposed. The construction removes repeated linear covariance signals, but it does not remove nonlinear association between later canonical variates.
[example: Orthogonality Does Not Mean Independence]
Let $Z\sim\mathcal N(0,1)$ and let $W=Z^2-1$. Then $\operatorname{Cov}(Z,W)=\mathbb E[Z^3]-\mathbb E[Z]\mathbb E[W]=0$, but $W$ is determined by $Z^2$ and is not independent of $Z$. This example shows how a CCA pair can be uncorrelated with another linear score while nonlinear dependence remains outside the canonical covariance diagonalisation.
[/example]
## Sample Canonical Correlation Analysis
How do we estimate the population canonical correlations from data? The population covariance blocks are replaced by sample covariance blocks, and the same whitening-and-SVD construction is applied.
[definition: Sample Covariance Blocks]
Given observations $x_1,\dots,x_n$ of $X$, with $x_i=(x_{i1}^\top,x_{i2}^\top)^\top$, let
\begin{align*}
S=
\begin{pmatrix}
S_{11} & S_{12}\\
S_{21} & S_{22}
\end{pmatrix}
\end{align*}
be the sample covariance matrix partitioned according to $X_1\in\mathbb R^p$ and $X_2\in\mathbb R^q$.
[/definition]
The sample procedure is meaningful when $S_{11}$ and $S_{22}$ are invertible. In practice this usually requires enough observations relative to $p$ and $q$, or else regularised versions of CCA are used.
[definition: Sample Canonical Correlations]
Assume $S_{11}$ and $S_{22}$ are positive definite. The sample canonical correlations $\hat\rho_1\ge\cdots\ge\hat\rho_s\ge0$, where $s=\min(p,q)$, are the singular values of
\begin{align*}
\hat C = S_{11}^{-1/2}S_{12}S_{22}^{-1/2}.
\end{align*}
If $\hat p_k\in\mathbb R^p$ and $\hat r_k\in\mathbb R^q$ are corresponding left and right singular vectors, the sample canonical direction maps on the two observation spaces are the linear maps
\begin{align*}
x_1 &\mapsto \hat a_k^\top x_1, & \hat a_k&=S_{11}^{-1/2}\hat p_k,\\
x_2 &\mapsto \hat b_k^\top x_2, & \hat b_k&=S_{22}^{-1/2}\hat r_k.
\end{align*}
[/definition]
This definition exactly parallels the population construction. The numerical core of sample CCA is an SVD, which is more stable than forming inverse matrices and solving a nonsymmetric eigenvalue problem.
[example: High-Dimensional Sample Inflation]
For a dataset of $n$ students, form the centred data matrices $Y_1\in\mathbb R^{n\times p}$ and $Y_2\in\mathbb R^{n\times q}$. Then $S_{11}=(n-1)^{-1}Y_1^\top Y_1$, $S_{22}=(n-1)^{-1}Y_2^\top Y_2$, and $S_{12}=(n-1)^{-1}Y_1^\top Y_2$. If $p\ge n-1$, then $S_{11}$ is singular because the centred matrix $Y_1$ has rank at most $n-1$; similarly for $S_{22}$ when $q\ge n-1$. Thus ordinary sample CCA can fail before any statistical testing begins, and even near this boundary the largest sample canonical correlations can be artificially large because many directions are searched relative to the sample size.
[/example]
## Testing Cross-Block Independence
When should the observed canonical correlations be regarded as evidence of genuine association rather than sampling variation? Under multivariate normality, Wilks' likelihood ratio statistic gives a classical global test of zero cross-covariance.
[quotetheorem:4045]
[citeproof:4045]
The statistic is small when at least one sample canonical correlation is large. Thus Wilks' test aggregates evidence across all canonical dimensions rather than focusing only on the first pair. Its interpretation is model-dependent: the likelihood-ratio derivation uses multivariate normality, and the displayed formula assumes that the sample covariance blocks used in CCA are invertible.
Wilks' statistic tests whether the population cross-covariance matrix is zero under that model. It does not prove causal dependence, does not identify which variables are scientifically responsible for the association, and does not rule out nonlinear dependence when the test is not significant. In finite samples, a small value of $\Lambda$ should therefore be read as evidence against zero linear cross-covariance under the stated normal model, not as a complete description of the relationship between the two blocks.
[example: Zero Cross-Covariance with Nonlinear Dependence]
Let $Z\sim\mathcal N(0,1)$, take $X_1=Z$, and take $X_2=Z^2$. Then $X_1$ and $X_2$ are dependent, but
\begin{align*}
\operatorname{Cov}(X_1,X_2)=\mathbb E[Z^3]-\mathbb E[Z]\mathbb E[Z^2]=0.
\end{align*}
A CCA-based test of zero cross-covariance is therefore not a general test of independence outside settings where the covariance structure fully controls dependence, such as the multivariate normal model.
[/example]
Wilks' statistic gives an exact likelihood-ratio object, but a usable test still needs a reference distribution. In practice the exact finite-sample law is difficult outside special low-dimensional cases, so CCA often uses Bartlett's large-sample correction. The approximation below explains how the product of the sample canonical-correlation factors is converted into a chi-squared statistic for testing zero population cross-covariance.
[quotetheorem:4046]
[citeproof:4046]
The approximation is most reliable when the sample size is large compared with $p+q$ and the covariance blocks are well-conditioned. The chi-squared limit keeps $p$ and $q$ fixed while $n\to\infty$; it is not a guarantee for regimes where the number of variables grows with the sample size.
A concrete failure mode occurs when $p+q$ is close to $n$. The sample covariance matrix then has limited rank, the whitening matrices become unstable or undefined, and maximising over many directions can produce large $\hat\rho_k$ even when $\Sigma_{12}=0$. In such a regime, Bartlett's correction may still output a number with a formal $\chi^2_{pq}$ reference distribution, but that reference distribution no longer accurately describes the sampling behaviour of the statistic.
## Connections with PCA and Regression
How does CCA relate to the other multivariate methods in the course? It can be viewed as a common framework for extracting linear structure, with PCA and multiple regression appearing as limiting or special cases.
[explanation: CCA and PCA]
Principal component analysis finds directions in a single vector $X$ that maximise variance subject to orthogonality. CCA instead finds paired directions in two vectors that maximise correlation after within-block standardisation. If one formally compares $X$ with itself, then the whitened cross-covariance becomes the identity on the nondegenerate part of the covariance space, so all canonical correlations equal one. PCA is therefore not obtained by the same objective with two independent blocks; rather, it appears as the one-block variance-maximisation analogue of the CCA geometry.
[/explanation]
With the local setup in place, the remaining issue is how the two views are related. The next explanation, CCA and Multiple Regression, uses that setup to move the discussion forward without changing topics abruptly.
[explanation: CCA and Multiple Regression]
When $q=1$, write $X_2=Y$ for a scalar response. There is only one nonzero canonical correlation, and maximising $\operatorname{Corr}(a^\top X_1,Y)$ over $a$ gives the multiple correlation coefficient between $Y$ and the predictors $X_1$. The canonical direction in the $X_1$ block is proportional to the population least-squares coefficient vector $\Sigma_{11}^{-1}\Sigma_{12}$. Thus CCA extends multiple regression from one response to a vector of responses.
[/explanation]
These connections explain why CCA is most useful when both sides are genuinely multivariate. It gives a symmetric analysis of association: neither block is singled out as the sole response, and the result is a ranked sequence of paired linear summaries.
CCA produces paired summaries of two variable blocks, but classification asks a different question: how should an observation be assigned to a known group? Linear discriminant analysis uses the same covariance ideas to construct separating rules and to measure how well they distinguish classes.
# 9. Linear Discriminant Analysis
Linear discriminant analysis asks how to classify an observation whose features are multivariate when the class labels are known for a training sample. The chapter assumes familiarity with the multivariate normal model from Chapter 2, covariance and Wishart sampling theory from Chapters 3-4, quadratic forms, eigenvalues, and elementary maximum-likelihood plug-in estimation. The central idea is to replace a high-dimensional classification problem by a few carefully chosen linear scores that separate class means while controlling within-class variation. This chapter connects the Fisher geometric construction with the Bayes rule for multivariate normal populations, then turns to sample estimates, error assessment, and regularisation when the covariance estimate is unstable.
## Fisher Two-Group Linear Discriminant
Suppose observations come from one of two populations and each observation is a vector in $\mathbb R^p$. The first question is how to choose a single numerical score $a^\top x$ that preserves as much class separation as possible.
[definition: Two-Group Linear Score]
Let $X$ be an $\mathbb R^p$-valued random vector belonging to one of two groups with mean vectors $\mu_1,\mu_2\in\mathbb R^p$. For a vector $a\in\mathbb R^p\setminus\{0\}$, the associated linear score is the map $s_a:\mathbb R^p\to\mathbb R$ defined by
\begin{align*}
s_a(x)=a^\top x.
\end{align*}
[/definition]
The two projected group means are $a^\top\mu_1$ and $a^\top\mu_2$. If the groups share a covariance matrix $\Sigma$, the projected within-group variance is $a^\top\Sigma a$, so the natural signal-to-noise ratio is the squared distance between projected means divided by projected variance.
Choosing $a$ by inspection is unreliable because a direction can create a large raw mean difference simply by also creating large within-class spread. The projection problem needs an objective that rewards separation only after normalising by that projected noise. The Fisher criterion names this objective function.
[definition: Fisher Criterion]
Let $\Sigma\in\mathbb R^{p\times p}$ be positive definite and let $\mu_1\ne\mu_2$. The Fisher criterion is the map $J:\mathbb R^p\setminus\{0\}\to\mathbb R$ defined by
\begin{align*}
J(a)=\frac{\{a^\top(\mu_1-\mu_2)\}^2}{a^\top\Sigma a}.
\end{align*}
[/definition]
The criterion is unchanged if $a$ is multiplied by a non-zero scalar. Thus the Fisher problem is a problem about directions in $\mathbb R^p$, not lengths.
[quotetheorem:4047]
[citeproof:4047]
The positive-definiteness of $\Sigma$ is essential here: it makes $a^\top\Sigma a$ strictly positive for every non-zero $a$ and makes $\Sigma^{-1}$ well-defined. If $\Sigma$ were singular, a direction with zero within-class variance could make the ratio undefined or infinite, and the geometric maximisation would need a separate constrained or regularised treatment. The theorem also does not choose a class label by itself; it identifies only the best projection direction. The resulting score is the Fisher linear discriminant:
\begin{align*}
\delta(x)=(\mu_1-\mu_2)^\top\Sigma^{-1}x.
\end{align*}
A threshold must still be chosen; the variance-ratio argument gives the direction, while probabilistic modelling supplies the classification rule.
[example: Two Gaussian Clouds]
Consider two groups in $\mathbb R^2$ with common covariance
\begin{align*}
\Sigma=\begin{pmatrix}4&1\\1&1\end{pmatrix},\qquad \mu_1=(1,1)^\top,\qquad \mu_2=(-1,0)^\top.
\end{align*}
Here $d=(2,1)^\top$ and a direct matrix calculation gives
\begin{align*}
\Sigma^{-1}d=\frac{1}{3}\begin{pmatrix}1&-1\\-1&4\end{pmatrix}\begin{pmatrix}2\\1\end{pmatrix}=\begin{pmatrix}1/3\\2/3\end{pmatrix}.
\end{align*}
Thus the score weights the second coordinate twice as heavily as the first, even though the raw mean difference is larger in the first coordinate. The covariance structure matters because the first coordinate has high variance and is correlated with the second.
[/example]
## Bayes Classification Under Equal Covariances
The next question is whether the geometrically chosen direction agrees with the classifier that minimises the probability of assigning a new observation to the wrong population. Under multivariate normal class-conditional distributions with equal covariance matrices, the answer is yes.
[definition: Bayes Classifier]
Let $Y\in\{1,\dots,g\}$ be a class label with prior probabilities $\pi_k=\mathbb P(Y=k)$, and let $X\in\mathbb R^p$ have class-conditional densities $f_k$. A Bayes classifier is a map $c_B:\mathbb R^p\to\{1,\dots,g\}$ such that
\begin{align*}
c_B(x)\in\operatorname{argmax}_{1\le k\le g}\, \pi_k f_k(x).
\end{align*}
[/definition]
The Bayes classifier minimises the conditional probability of error at each observed value $x$. In the two-group normal model, comparing the two posterior numerators is the same as comparing their logarithms.
The remaining question is whether that logarithmic comparison produces a linear boundary or something more complicated. Under equal covariance matrices, the normal-density algebra collapses to a linear discriminant rule.
[quotetheorem:4048]
[citeproof:4048]
The equal-covariance hypothesis is the reason the quadratic terms in $x$ cancel when the two log densities are compared. If the covariance matrices differ, terms such as $x^\top\Sigma_1^{-1}x$ and $x^\top\Sigma_2^{-1}x$ remain, so the decision boundary is generally quadratic rather than linear. Positive-definiteness is also needed because the normal density and the inverse covariance matrix must be well-defined. This theorem is therefore the probabilistic justification for LDA under a specific model: the Fisher construction determines the same normal vector to the separating hyperplane, while the Bayes calculation determines the intercept.
[example: Equal Priors And Equal Covariance]
Let $X\mid Y=k\sim\mathcal N_2(\mu_k,I_2)$ with equal priors, where $\mu_1=(1,0)^\top$ and $\mu_2=(-1,0)^\top$. The rule assigns $x=(x_1,x_2)^\top$ to group $1$ when $2x_1>0$, so the decision boundary is the vertical line $x_1=0$. The second coordinate is ignored by the classifier because it has the same distribution in both groups.
[/example]
## Quadratic Discriminant Analysis
The equal-covariance assumption is restrictive. If the class covariance matrices differ, the Bayes comparison retains its quadratic terms, so the separating surface need not be a hyperplane.
[definition: Quadratic Discriminant Score]
For class $k$, suppose $X\mid Y=k\sim\mathcal N_p(\mu_k,\Sigma_k)$, where $\Sigma_k$ is positive definite and the prior probability is $\pi_k>0$. The quadratic discriminant score is the map $q_k:\mathbb R^p\to\mathbb R$ defined by
\begin{align*}
q_k(x)= -\frac{1}{2}\log\det\Sigma_k-\frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)+\log\pi_k.
\end{align*}
[/definition]
QDA assigns $x$ to the class with largest $q_k(x)$. The phrase quadratic discriminant analysis reflects the fact that the difference $q_j(x)-q_k(x)$ is generally a quadratic polynomial in the coordinates of $x$.
Since QDA is introduced by its score functions, we still need to connect those scores to optimal classification. The theorem records that choosing the largest quadratic score is exactly the Bayes rule for the unequal-covariance Gaussian model.
[quotetheorem:4049]
[citeproof:4049]
The positive-definiteness of each $\Sigma_k$ is needed for both the determinant and inverse in the score. The theorem does not say that the resulting classifier is easier to estimate from data; it is an optimal population rule under a richer normal model. A concrete failure of the LDA simplification occurs when two groups have the same mean but different variances: LDA has no mean separation to use, while QDA can classify using distance from the common centre because the quadratic and determinant terms differ. QDA is therefore more flexible than LDA but estimates many more covariance parameters. In $p$ dimensions with $g$ groups, QDA estimates $g$ covariance matrices, while LDA estimates one pooled covariance matrix.
[example: LDA And QDA Boundaries]
In two dimensions, take equal priors and class means $\mu_1=(-1,0)^\top$, $\mu_2=(1,0)^\top$. If both classes have covariance $I_2$, the LDA and QDA boundaries both reduce to the line $x_1=0$. If instead $\Sigma_1=I_2$ and $\Sigma_2=\operatorname{diag}(4,1)$, then the QDA comparison includes $x_1^2$ terms with different coefficients, so the boundary becomes a conic section. This example shows why QDA can follow curved separation patterns that LDA cannot represent.
[/example]
[illustration:lda-qda-boundaries]
## Multi-Group Linear Discriminant Analysis
With more than two groups, a single projection may not capture all relevant separation. The problem becomes finding a low-dimensional subspace in which group centres are well separated relative to within-group scatter.
[definition: Between-Group And Within-Group Matrices]
Let there be $g$ groups in $\mathbb R^p$, with group means $\mu_1,\dots,\mu_g$, common covariance matrix $\Sigma$, and prior weights $\pi_1,\dots,\pi_g$. Let $\bar\mu=\sum_{k=1}^g \pi_k\mu_k$. The population between-group matrix is
\begin{align*}
B=\sum_{k=1}^g \pi_k(\mu_k-\bar\mu)(\mu_k-\bar\mu)^\top,
\end{align*}
and the population within-group matrix is
\begin{align*}
W=\Sigma.
\end{align*}
[/definition]
The matrix $B$ records how the class centres vary around the overall mean, while $W$ records the common within-class covariance. Since $B$ is built from $g$ centred mean vectors whose weighted sum is zero, its rank is at most $g-1$.
Once separation and noise are encoded by matrices, the scalar two-group Fisher ratio becomes a family of possible discriminant directions. The next question is how many meaningful directions can exist and how they are found from $B$ and $W$. This is the eigenvalue problem behind multi-group LDA.
[quotetheorem:4050]
[citeproof:4050]
The positive-definiteness of $W$ is what turns the ratio $v^\top Bv/v^\top Wv$ into a well-posed Rayleigh quotient and allows whitening by $W^{-1/2}$. If $W$ is singular, some directions may have zero estimated within-group variance, and the generalised eigenproblem may have unstable or non-unique solutions. The theorem also does not imply that all $p$ coordinates are useful for discrimination; the number of non-zero directions is limited by the number of groups as well as by the ambient dimension. In practice these directions are used both for classification and for visualisation. A $g$-group LDA fit can be plotted in the first two discriminant coordinates when at least two non-zero discriminant directions exist.
[example: Iris Species Classification]
The iris data have $g=3$ species and $p=4$ measurements: sepal length, sepal width, petal length, and petal width. Multi-group LDA can produce at most $\min(3-1,4)=2$ discriminant functions. The first discriminant coordinate separates setosa from the other species because its petal measurements are far from the remaining two groups, while the second coordinate mainly separates versicolor and virginica. This is a standard example where the dimension reduction is also a useful visual summary.
[/example]
## Sample LDA And Plug-In Classification
The population quantities in LDA are unknown, so data analysis replaces them with sample estimates. The important statistical question is what is being estimated and how the estimates determine the decision boundary.
[definition: Sample Group Means And Pooled Covariance]
Suppose group $k$ has observations $x_{k1},\dots,x_{kn_k}\in\mathbb R^p$, and let $n=\sum_{k=1}^g n_k$. The sample mean in group $k$ is
\begin{align*}
\hat\mu_k=\frac{1}{n_k}\sum_{i=1}^{n_k}x_{ki}.
\end{align*}
The sample covariance in group $k$ is
\begin{align*}
S_k=\frac{1}{n_k-1}\sum_{i=1}^{n_k}(x_{ki}-\hat\mu_k)(x_{ki}-\hat\mu_k)^\top.
\end{align*}
The pooled covariance matrix is
\begin{align*}
S_{\mathrm{pooled}}=\frac{1}{n-g}\sum_{k=1}^g (n_k-1)S_k.
\end{align*}
[/definition]
The plug-in LDA rule substitutes $\hat\mu_k$, $S_{\mathrm{pooled}}$, and estimated priors into the normal equal-covariance Bayes rule. If empirical priors are used, then $\hat\pi_k=n_k/n$.
For classification, these fitted quantities are still not directly comparable across groups: each candidate class needs a single discriminant value that combines location, covariance scaling, and prior probability on the same scale. The next definition packages the fitted normal rule into those comparable scores, so the classifier can be written as a simple maximisation problem.
[definition: Sample LDA Score]
For group $k$, the sample LDA score is the map $\hat\ell_k:\mathbb R^p\to\mathbb R$ defined by
\begin{align*}
\hat\ell_k(x)=x^\top S_{\mathrm{pooled}}^{-1}\hat\mu_k-
\frac{1}{2}\hat\mu_k^\top S_{\mathrm{pooled}}^{-1}\hat\mu_k+\log\hat\pi_k.
\end{align*}
The sample LDA classifier is the map $\hat c:\mathbb R^p\to\{1,\dots,g\}$ satisfying
\begin{align*}
\hat c(x)\in\operatorname{argmax}_{1\le k\le g}\,\hat\ell_k(x).
\end{align*}
[/definition]
The boundary between groups $j$ and $k$ is obtained by setting $\hat\ell_j(x)=\hat\ell_k(x)$. All quadratic terms in $x$ have disappeared because a common covariance matrix is used, so this boundary is a hyperplane.
The formula suggests linear decision boundaries, but the claim depends on the pooled covariance being invertible and on the common-covariance form of the scores. A precise statement is needed to separate this geometric conclusion from cases where the fitted covariance is singular or group-specific. The following result records the linear-boundary property of sample LDA.
[quotetheorem:4051]
[citeproof:4051]
The invertibility assumption is not a technical decoration: without it, $S_{\mathrm{pooled}}^{-1}$ does not exist and the displayed score is not defined. This commonly happens when $p\ge n-g$, because the pooled sample covariance then cannot have full rank. A singular pooled covariance also means that some directions have no estimated within-class variation, so the fitted boundary can depend sensitively on regularisation or on a chosen generalised inverse. For $g>2$, linear pairwise boundaries do not mean the full classifier has a single separating hyperplane; the classification regions are intersections of several half-spaces and can form a multi-cell polyhedral partition.
[example: Two-Species Plug-In Boundary]
Suppose two fitted groups have empirical priors $1/2$, sample means $\hat\mu_1=(0,0)^\top$, $\hat\mu_2=(2,0)^\top$, and pooled covariance $S_{\mathrm{pooled}}=\operatorname{diag}(1,4)$. The boundary equation becomes $x_1=1$. The second coordinate has larger variance but no mean difference, so it does not enter the separating hyperplane.
[/example]
## Error Rates And Mahalanobis Distance
A fitted classifier must be assessed on data. The apparent error rate is fast to compute but is optimistic because each observation helps determine the rule that later classifies it.
[definition: Apparent Error Rate]
For a fitted classifier $\hat c:\mathbb R^p\to\{1,\dots,g\}$ trained on labelled observations $(x_i,y_i)_{i=1}^n$, the apparent error rate is
\begin{align*}
\operatorname{AER}=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{\hat c(x_i)\ne y_i\}}.
\end{align*}
[/definition]
The apparent error rate uses the same data for fitting and checking, so it can understate the error a new observation would face. A more local diagnostic asks each training point to be classified by a rule that did not use that point in its own fitting. Leave-one-out cross-validation implements this idea while still using almost all the data for each refit.
[definition: Leave-One-Out Error Rate]
Let $\hat c^{(-i)}$ be the classifier trained on all labelled observations except $(x_i,y_i)$. The leave-one-out error rate is
\begin{align*}
\operatorname{LOO}=\frac{1}{n}\sum_{i=1}^n \mathbb{1}_{\{\hat c^{(-i)}(x_i)\ne y_i\}}.
\end{align*}
[/definition]
For two normal groups with equal covariance and equal priors, the separation between the populations is measured by the squared Mahalanobis distance
\begin{align*}
\Delta^2=(\mu_1-\mu_2)^\top\Sigma^{-1}(\mu_1-\mu_2).
\end{align*}
The larger $\Delta$ is, the smaller the Bayes error is.
To make that relationship usable, the qualitative statement has to be converted into an error probability. Under the equal-prior normal model, the optimal rule reduces to a one-dimensional normal threshold after projecting along the discriminant direction. The resulting formula links Bayes error directly to $\Delta$.
[quotetheorem:4052]
[citeproof:4052]
The equal-prior and equal-covariance assumptions are both doing work in this formula. Unequal priors shift the threshold and make the two conditional error probabilities different, while unequal covariances return to the QDA setting where a single projected normal calculation is no longer the whole story. The formula also gives the population Bayes error, not the error of a sample LDA classifier with estimated parameters. This distinction explains why LDA is tied to Mahalanobis distance rather than Euclidean distance: separation along high-variance directions counts less than the same raw separation along low-variance directions.
[example: Mahalanobis Separation]
For two equal-prior normal groups with common covariance $I_p$, the squared separation is $\Delta^2=|\mu_1-\mu_2|^2$. If the covariance is instead $\operatorname{diag}(9,1)$ and the mean difference is $(3,0)^\top$, then $\Delta^2=1$, giving Bayes error $\Phi(-1/2)$. The Euclidean distance is $3$, but the separation is less decisive because it lies entirely in a high-variance direction.
[/example]
## Regularised Discriminant Analysis
Classical LDA depends on inverting the pooled covariance matrix. When $p$ is large compared with the sample size, this matrix may be singular or unstable, and QDA is even more demanding because it estimates a separate covariance for each group.
[definition: Regularised Discriminant Covariance]
For a tuning parameter $\alpha\in[0,1]$, group covariance estimate $S_k$, and pooled covariance estimate $S_{\mathrm{pooled}}$, define
\begin{align*}
\hat\Sigma_k(\alpha)=\alpha S_k+(1-\alpha)S_{\mathrm{pooled}}.
\end{align*}
[/definition]
At $\alpha=0$, all groups use the pooled covariance estimate, giving LDA. At $\alpha=1$, each group uses its own covariance estimate, giving QDA. Intermediate values shrink group-specific covariance matrices toward the pooled estimate.
A covariance estimate alone does not classify a new observation; it must be inserted into a discriminant score that also accounts for the fitted mean, determinant penalty, and prior weight. Because the covariance now depends on both the group and the tuning parameter, the corresponding score is quadratic in $x$ and varies with $\alpha$. The regularised score makes this fitted rule explicit.
[definition: Regularised Discriminant Score]
Given $\hat\Sigma_k(\alpha)$ and prior estimate $\hat\pi_k$, the regularised discriminant score is the map $\hat q_{k,\alpha}:\mathbb R^p\to\mathbb R$ defined by
\begin{align*}
\hat q_{k,\alpha}(x)= -\frac{1}{2}\log\det\hat\Sigma_k(\alpha)-\frac{1}{2}(x-\hat\mu_k)^\top\hat\Sigma_k(\alpha)^{-1}(x-\hat\mu_k)+\log\hat\pi_k.
\end{align*}
[/definition]
The tuning parameter is usually selected by cross-validation. In high-dimensional settings, further shrinkage toward a diagonal or scalar covariance matrix is often added, but the basic principle remains the same: trade bias for lower estimation variance.
[example: High-Dimensional Classification]
Suppose $p=100$ and there are only $20$ observations in each of three groups. The individual covariance matrices $S_k$ are singular because $n_k-1<p$, so unregularised QDA cannot invert them. If $S_{\mathrm{pooled}}$ is better conditioned or is itself combined with a diagonal shrinkage target, then regularised discriminant analysis gives usable scores. Cross-validation can compare values of $\alpha$ by estimating the out-of-sample misclassification rate.
[/example]
## Summary Of The LDA Framework
The chapter has moved from the Fisher projection problem to Bayes classification and then to sample implementation. The Fisher theorem says that the best one-dimensional separating direction is $\Sigma^{-1}(\mu_1-\mu_2)$, while the normal equal-covariance Bayes rule says that the full classifier has linear boundaries with the same normal direction. QDA replaces linear boundaries by quadratic ones when class covariances differ, and multi-group LDA uses the eigenstructure of $W^{-1}B$ to find up to $\min(g-1,p)$ discriminant coordinates. In applications, apparent and leave-one-out error rates assess fitted rules, Mahalanobis distance controls ideal two-group error, and regularisation stabilises covariance estimation when dimension is large relative to sample size.
LDA turns multivariate geometry into a classification rule, and the next natural extension is to compare groups on several responses at once. MANOVA generalizes ANOVA by testing whether group mean vectors differ, using the same covariance machinery in a multivariate setting.
# 10. Multivariate Analysis of Variance (MANOVA)
This chapter assumes the preceding material on univariate ANOVA, covariance matrices, multivariate normal samples from Chapters 2-4, Hotelling-style mean inference from Chapter 5, and likelihood-ratio testing. The goal is to understand how ANOVA changes when each experimental unit produces several correlated responses rather than a single number. The main difficulty is diagnostic: separate univariate ANOVAs can miss a difference that only appears along a linear combination of responses, while a naive scalar summary of the covariance matrix can hide which directions matter.
Multivariate analysis of variance asks the same question as ordinary analysis of variance, but with a vector response. Instead of comparing group means for one measured quantity, we compare mean vectors for several related measurements at once. The central difficulty is that the variables may be correlated, so a collection of separate univariate tests does not capture the geometry of the joint response. This chapter builds MANOVA from the one-way model, derives the decomposition of total variation into between-group and within-group matrices, and studies the four standard likelihood-root statistics. The final sections explain how the same matrix ideas extend to post-hoc inference, robustness comparisons, two-way designs, the general linear model, and profile analysis for repeated measurements.
## The One-Way MANOVA Question
Suppose each experimental unit produces a vector $X_{ij} \in \mathbb R^p$, where $i \in \{1,\dots,g\}$ labels the group and $j \in \{1,\dots,n_i\}$ labels the unit within the group. The first question is whether the groups differ in their mean vectors after accounting for the covariance among the $p$ responses. If the response coordinates are skull length, skull breadth, facial height, and nasal height, for instance, the question is not whether each coordinate differs separately, but whether archaeological sites have different multivariate mean morphology.
[definition: One-Way MANOVA Model]
Let $X_{ij}: (\Omega, \mathcal F) \to \mathbb R^p$ be random vectors, with $i \in \{1,\dots,g\}$ and $j \in \{1,\dots,n_i\}$. The one-way MANOVA model is
\begin{align*}
X_{ij} = \mu + \tau_i + \epsilon_{ij},
\end{align*}
where $\mu \in \mathbb R^p$, $\tau_i \in \mathbb R^p$, and the errors $\epsilon_{ij}$ are independent with $\epsilon_{ij} \sim \mathcal N(0,\Sigma)$ for a positive definite matrix $\Sigma \in \mathbb R^{p \times p}$.
[/definition]
The parameters $\tau_i$ represent group effects. To make the parametrisation identifiable, it is common to impose $\sum_{i=1}^g n_i\tau_i=0$, so that $\mu$ is the overall mean. The null hypothesis of no group effect is
\begin{align*}
H_0: \tau_1 = \cdots = \tau_g = 0.
\end{align*}
[definition: Sample Mean Vectors]
In the one-way MANOVA model, write
\begin{align*}
\bar X_i &= \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij}, &
\bar X &= \frac{1}{N}\sum_{i=1}^g\sum_{j=1}^{n_i} X_{ij}, &
N &= \sum_{i=1}^g n_i.
\end{align*}
[/definition]
The group mean vector $\bar X_i$ estimates $\mu+\tau_i$, while the pooled mean $\bar X$ estimates the overall mean. Under $H_0$, the differences $\bar X_i-\bar X$ should be small relative to the within-group scatter. The reason MANOVA is needed is that a small coordinatewise difference can become large after the coordinates are rotated into a scientifically meaningful direction.
[example: Skull Measurements Across Archaeological Sites]
Suppose skulls from $g=4$ archaeological sites are measured using $p=5$ cranial variables, and $X_{ij}$ records the measurement vector for skull $j$ from site $i$. A one-way MANOVA tests whether the sites have a common mean morphology. If every site differs by increasing skull length and decreasing skull breadth in a coordinated way, the separate univariate ANOVAs may split the effect into two moderate signals. MANOVA instead studies the contrast vector $\bar X_i-\bar X$ against the pooled within-site covariance, so it can detect a separation along the combined size-shape direction. This example shows the basic failure mode of coordinatewise testing: the signal may live in a rotated response direction rather than in a named measurement coordinate.
[/example]
## Decomposing Multivariate Variation
How should total variation be split into variation explained by group membership and variation left within groups? In univariate ANOVA the identity is a decomposition of sums of squares. MANOVA replaces each square by an outer product, producing sum-of-squares-and-products matrices.
[definition: Total Within And Between SSP Matrices]
For the one-way MANOVA model, define
\begin{align*}
T &= \sum_{i=1}^g\sum_{j=1}^{n_i} (X_{ij}-\bar X)(X_{ij}-\bar X)^\top,\\
W &= \sum_{i=1}^g\sum_{j=1}^{n_i} (X_{ij}-\bar X_i)(X_{ij}-\bar X_i)^\top,\\
B &= \sum_{i=1}^g n_i(\bar X_i-\bar X)(\bar X_i-\bar X)^\top.
\end{align*}
[/definition]
The matrix $T$ measures total scatter around the grand mean, $W$ measures scatter remaining after fitting separate group means, and $B$ measures scatter of group means around the grand mean. Each is symmetric and positive semidefinite.
[quotetheorem:4053]
[citeproof:4053]
This identity is the algebraic foundation of MANOVA. The centering by $\bar X_i$ is essential: without it the mixed terms would not cancel, and the decomposition would not separate fitted group variation from residual variation. The theorem does not say how to order matrices by size; it only reduces the problem to comparing the pair $(B,W)$. That unresolved comparison is what forces the move from sums of squares to characteristic roots.
[example: Two Response Coordinates]
For $p=2$, each matrix records variation in two coordinates and their covariance. Suppose the two coordinatewise between-group sums of squares are modest, but the off-diagonal entry of $B$ is large and positive because group means move together along the line $x_1\approx x_2$. Then the diagonal ANOVA summaries understate the separation, while the leading eigenvector of the matrix comparison points along the combined direction. This example teaches a reusable diagnostic: inspect covariance and eigendirections, not only the coordinatewise sums of squares.
[/example]
## Characteristic Roots And Canonical Correlations
The next problem is to reduce the matrix comparison between $B$ and $W$ to scalar test statistics without throwing away the covariance geometry. A single ratio such as $\operatorname{tr}(B)/\operatorname{tr}(W)$ can fail because it ignores whether the between-group variation lies in a low-noise or high-noise direction. The common solution is to examine the eigenvalues of the pair $(B,W)$, or equivalently the eigenvalues of $(B+W)^{-1}B$ when the relevant inverse exists.
[definition: MANOVA Characteristic Roots]
Assume $W$ is positive definite. The MANOVA characteristic roots are the nonzero eigenvalues $\theta_1 \ge \cdots \ge \theta_s > 0$ of $W^{-1}B$, where $s \le \min(p,g-1)$.
[/definition]
The roots $\theta_k$ are natural for matrix algebra, but their scale is not directly comparable to an ordinary correlation coefficient. To interpret each latent group-response direction on the familiar $[0,1]$ association scale, we need the following reparametrisation as squared canonical correlations.
[definition: Sample Canonical Correlations In MANOVA]
The sample canonical correlations $\hat\rho_1,\dots,\hat\rho_s$ associated with the MANOVA decomposition are defined by
\begin{align*}
\hat\rho_k^2 = \frac{\theta_k}{1+\theta_k}, \qquad k=1,\dots,s.
\end{align*}
[/definition]
These quantities measure the squared association between linear combinations of the response variables and linear contrasts of the group indicators. Thus MANOVA can be viewed as canonical correlation analysis between responses and group membership.
Wilks' statistic is defined through determinants, while the canonical-correlation view is expressed through roots and squared correlations. To compare these descriptions, the determinant ratio must be rewritten as a product over the same latent directions. The following identity provides that bridge.
[quotetheorem:4054]
[citeproof:4054]
The product formula gives the geometric meaning of Wilks' statistic: it is small when at least one response-group canonical correlation is large, and it is even smaller when several independent directions of separation are present. Positive definiteness of $W$ is needed so that residual scatter defines a genuine metric on response directions; if $W$ is singular, some linear combinations have zero residual variance and the ordinary inverse formulation breaks down. The theorem does not give a null distribution, only an algebraic identity linking determinants, roots, and canonical correlations. That identity is the bridge to the four classical scalar summaries.
[example: One Dominant Separation Direction]
Suppose three sites differ mostly along a single cranial size axis, with very little additional shape separation. Then $\hat\rho_1^2$ may be large while $\hat\rho_2^2$ is close to zero. Wilks' statistic is approximately $1-\hat\rho_1^2$, and Roy's largest-root statistic gives a similar ordering of evidence because the first root dominates.
[/example]
## The Four Classical MANOVA Statistics
Which scalar summary of the roots should be used for testing? Classical MANOVA offers four answers. Each statistic is a different way of aggregating the same latent roots, and each is most sensitive to a different pattern of alternatives.
[definition: Wilks Statistic]
For positive definite $W$ and $T=W+B$, Wilks' statistic is
\begin{align*}
\Lambda = \frac{|W|}{|T|}.
\end{align*}
[/definition]
Small values of $\Lambda$ indicate that the within-group determinant is small compared with the total determinant. This is the likelihood-ratio statistic for the multivariate normal one-way MANOVA model.
Wilks' statistic multiplies contributions from the roots, so a small number of strong directions can dominate the product. Another summary asks for an additive, bounded contribution from each canonical direction. Pillai's trace gives that more evenly controlled aggregation.
[definition: Pillai Trace]
For MANOVA matrices $B$ and $W$ such that $B+W$ is positive definite, Pillai's trace is
\begin{align*}
V = \operatorname{tr}\bigl(B(B+W)^{-1}\bigr).
\end{align*}
[/definition]
Pillai's trace sums the squared canonical correlations. It is often stable under moderate departures from equality of covariance matrices because each contribution is bounded by $1$.
Bounded canonical-correlation contributions are not always the desired emphasis. When the analysis should give larger separation roots proportionally more influence, the statistic must work on the root scale rather than on the bounded correlation scale. This motivates a trace that adds the eigenvalues of $W^{-1}B$ themselves.
[definition: Lawley Hotelling Trace]
For positive definite $W$, the Lawley-Hotelling trace is
\begin{align*}
U = \operatorname{tr}(W^{-1}B).
\end{align*}
[/definition]
This statistic sums the roots $\theta_k$, so large roots receive relatively strong weight. It is powerful when several roots are moderately large.
Sometimes the scientific alternative is expected to appear mainly along one response direction rather than spread across several directions. In that setting, summing all roots can dilute the most informative separation, so the test statistic should ignore weaker directions and retain only the strongest canonical separation. This leads to the largest-root statistic.
[definition: Roy Largest Root]
For positive definite $W$, Roy's largest-root statistic is
\begin{align*}
\theta = \theta_1 = \frac{\hat\rho_1^2}{1-\hat\rho_1^2},
\end{align*}
where $\theta_1$ is the largest eigenvalue of $W^{-1}B$.
[/definition]
Roy's statistic focuses on the best separating linear combination of the response variables. It can be very sensitive to a rank-one alternative, but it may miss alternatives spread across several smaller roots.
The four statistics have been defined in different algebraic forms, so it remains to put them on a common footing. The needed comparison is through the same generalized eigenvalues of the between- and within-group SSP matrices; once those roots are identified, each statistic is just a different function of them.
[quotetheorem:4055]
[citeproof:4055]
The hypotheses ensure that the matrix comparison can be reduced to ordinary real roots after whitening by $W$. If $W$ is singular, the same ideas require a generalized inverse or a restriction to the estimable response subspace, and the displayed formulas need additional qualification. These formulas show that the four tests differ only after the common root extraction step, so the statistical choice is about how to weight directions of separation.
[example: Comparing Alternatives]
Consider two alternatives with the same total apparent separation. In the first, the roots are approximately $(4,0,0)$; in the second, they are approximately $(1.4,1.3,1.3)$. A statistic based only on $\operatorname{tr}(B)$ would not reliably distinguish whether the effect is concentrated in one discriminating direction or spread across several directions after residual scaling. Roy's statistic favours the first because it only sees the largest root, while Lawley-Hotelling is competitive for the second because it adds all roots. Pillai and Wilks also use all directions, with Pillai damping very large individual roots through the transformation $\theta_k/(1+\theta_k)$.
[/example]
## Null Distributions And Approximations
After choosing a statistic, the testing problem becomes distributional: how large is the observed separation under $H_0$? Exact null distributions are available in special cases, but general MANOVA practice often uses transformations to $F$ or chi-squared reference distributions.
[quotetheorem:4056]
[citeproof:4056]
The common-covariance normal model is essential for the likelihood calculation: it gives the residual SSP matrices their Wishart structure and makes determinants the relevant likelihood summaries. The theorem does not assert that Wilks' statistic is uniformly most powerful, nor does it yet provide a critical value. The exact distribution of $\Lambda$ depends on $p$, $g-1$, and $N-g$.
For large samples, one route is to keep Wilks' likelihood-ratio interpretation and approximate its logarithm by a chi-squared variable. This is useful when the goal is a simple dimension-counting calibration rather than an exact finite-sample transformation. Bartlett's correction adjusts the raw log determinant ratio before applying the asymptotic reference law: in the usual common-covariance MANOVA model, the corrected statistic
\begin{align*}
-\left(N-1-\frac{p+g}{2}\right)\log\Lambda
\end{align*}
is compared asymptotically with a chi-squared distribution with $p(g-1)$ degrees of freedom.
The Bartlett correction keeps the likelihood-ratio logic but replaces the exact finite-sample distribution by an asymptotic reference law. The factor multiplying $\log\Lambda$ matters because uncorrected Wilks theory can be noticeably liberal when the residual degrees of freedom are not large compared with $p$ and $g$. This approximation should therefore be read as a large-sample calibration, not as an exact MANOVA identity.
A second practical calibration keeps an $F$ reference distribution, which is the format used in many MANOVA tables. This is helpful because it parallels univariate ANOVA output while still accounting for the response dimension and the number of group contrasts.
[remark: Rao's F Approximation]
Rao's approximation is a reporting method for Wilks' statistic, not an exact finite-sample distribution theorem. It converts $\Lambda$ into an approximate $F$ statistic using dimension-dependent quantities determined by the response dimension, the number of group contrasts, and the residual degrees of freedom. The resulting reference law is useful for many moderate-dimensional MANOVA tables, but it should be treated as an approximation whose validity depends on the usual denominator degrees of freedom being positive and on avoiding exceptional low-dimensional cases where exact formulas are available.
[/remark]
This is why software usually branches between exact low-dimensional formulas and the generic Rao approximation.
[remark: Low-Dimensional Exact Cases]
When $p=1$, MANOVA reduces to ordinary one-way ANOVA and Wilks' test is equivalent to the usual $F$ test. When $q=1$, the problem reduces to Hotelling's two-sample or one-factor multivariate comparison, and exact $F$ transformations are available under the multivariate normal model.
[/remark]
These exact cases are not merely conveniences; they mark situations where the multivariate root problem collapses to a single familiar direction. In those settings the approximations above should be checked against the specialised formula rather than applied automatically.
[example: Interpreting A Wilks Test]
In a skull-measurement study with $p=5$, $g=4$, and $N=80$, suppose the computed value is $\Lambda=0.61$. The Bartlett statistic uses $p(g-1)=15$ degrees of freedom and transforms $\log\Lambda$ by the corrected sample-size factor. Rao's approximation instead converts $\Lambda$ to an $F$ statistic with numerator degrees of freedom $15$ and an adjusted denominator degrees of freedom. The reusable lesson is that the same observed determinant ratio must be calibrated against dimension and residual degrees of freedom: a value such as $0.61$ may be strong evidence in a low-dimensional design but less decisive when many response and group directions are being searched. Both calculations ask whether the observed reduction from $T$ to $W$ is too large to attribute to sampling variation under equal mean vectors.
[/example]
## Post-Hoc Analysis After A Global Rejection
A global MANOVA rejection answers the existence question, not the explanatory question. After rejecting equal mean vectors, we need to identify which response coordinates, contrasts, or linear combinations account for the effect while controlling the error introduced by searching after the global test.
[definition: Linear Contrast Of Mean Vectors]
Let $c=(c_1,\dots,c_g) \in \mathbb R^g$ satisfy $\sum_{i=1}^g c_i=0$. A linear contrast of group mean vectors is
\begin{align*}
\sum_{i=1}^g c_i\mu_i \in \mathbb R^p,
\end{align*}
where $\mu_i=\mu+\tau_i$.
[/definition]
A contrast can compare two groups, compare one group with the average of several others, or encode a planned ordering. Post-hoc MANOVA usually studies coordinate projections of such contrasts and selected linear combinations.
Once contrasts are chosen after a global rejection, the main difficulty is simultaneous inference: many coordinates and contrasts may be inspected, and separate one-at-a-time intervals no longer have the advertised joint coverage. The formal question is how to turn the contrast estimator and residual covariance estimate into intervals that cover all selected contrast coordinates at once, so the post-hoc interpretation keeps a controlled error statement.
[quotetheorem:4059]
[citeproof:4059]
Simultaneous intervals treat the response coordinates as a family to be protected all at once. Some studies instead come with an ordered scientific hierarchy of variables, where the question is whether later measurements add information after earlier measurements have already been accounted for. That ordered post-hoc strategy needs its own definition because the conditioning structure is part of the method.
[definition: Step-Down MANOVA Test]
A step-down MANOVA test is a sequence of univariate or conditional tests applied to response variables in a prespecified order, where the test for a later variable adjusts for variables already entered earlier in the sequence.
[/definition]
The ordering must be chosen before inspecting the effects if the resulting inference is to retain its advertised interpretation. Otherwise the step-down procedure becomes another data-dependent search and its nominal error probabilities no longer describe the analysis actually performed. This limitation is the same post-hoc danger seen in univariate multiple comparisons, but here it applies both to variables and to response directions.
[example: Post-Hoc Skull Measurements]
After a significant MANOVA for skull measurements, an archaeologist might examine planned contrasts comparing site 1 against the average of sites 2, 3, and 4 for each cranial coordinate. Simultaneous intervals reveal which coordinates contribute to the site separation. A step-down analysis might enter cranial length first, then cranial breadth, then facial and nasal measurements, asking whether each later variable adds group information beyond the earlier anatomical dimensions.
[/example]
## Robustness And Power Of The Four Tests
The practical question is not only which statistic has a named distribution, but which statistic behaves well when the model is imperfect or the alternative has a particular shape. The four statistics can disagree because they weight the MANOVA roots differently.
[quotetheorem:4060]
[citeproof:4060]
This theorem is qualitative because exact power depends on the unknown alternative, sample sizes, and covariance structure. Its value is diagnostic: before choosing a default statistic, ask whether the scientific alternative is expected to be one strong discriminant direction or many smaller directions. That diagnostic also explains why robustness matters, since model departures can change the apparent root pattern.
[remark: Robustness Under Covariance Departures]
MANOVA assumes common covariance matrices across groups. Pillai's trace is often preferred as a default statistic when moderate covariance inequality or non-normality is suspected, especially with balanced group sizes. Severe imbalance combined with unequal covariance matrices can distort all four classical tests.
[/remark]
The robustness warning is not a licence to ignore the model assumptions. It says that Pillai's bounded root transform can reduce sensitivity to some perturbations, but it does not repair a badly designed or severely heteroscedastic study.
[example: Diffuse Shape Difference]
In a morphology study, suppose each individual cranial measurement differs a little across sites, and the effects point in several nearly orthogonal shape directions. Roy's test may be less impressive because no single canonical correlation dominates. Pillai's trace and Lawley-Hotelling trace accumulate evidence across directions, so they better reflect the diffuse multivariate shift.
[/example]
## Two-Way MANOVA And The General Linear Model
The one-way model handles a single grouping factor. Many experiments have two factors, covariates, interactions, or repeated profiles, so the final problem is to express MANOVA in a form that accommodates general linear hypotheses.
[definition: Multivariate General Linear Model]
The multivariate general linear model is
\begin{align*}
Y = XB + E,
\end{align*}
where $Y \in \mathbb R^{N\times p}$ is the response matrix, $X \in \mathbb R^{N\times r}$ is a fixed design matrix, $B \in \mathbb R^{r\times p}$ is the coefficient matrix, and the rows of $E$ are independent with distribution $\mathcal N(0,\Sigma)$.
[/definition]
This model contains one-way MANOVA by taking $X$ to be the group indicator design matrix. It also contains two-way MANOVA by including columns for the two main effects and their interaction.
To test effects inside this model, we need notation that can select rows of the coefficient matrix and, when needed, selected response contrasts. A multivariate linear hypothesis records both choices in matrix form, so main effects, interactions, regression effects, and profile contrasts can be handled by the same formula.
[definition: Multivariate Linear Hypothesis]
In the multivariate general linear model with $B \in \mathbb R^{r\times p}$, let $C \in \mathbb R^{k\times r}$ and $M \in \mathbb R^{p\times \ell}$. A multivariate linear hypothesis has the form
\begin{align*}
H_0: CBM = 0,
\end{align*}
where $CBM \in \mathbb R^{k\times \ell}$.
[/definition]
The matrix $C$ selects group, factor, or regression effects. The matrix $M$ selects response contrasts, which is what makes profile analysis fit into the same framework.
After the hypothesis is written as $CBM=0$, the remaining question is how to build the between-hypothesis and residual SSP matrices whose roots determine the test statistics. Redundant contrasts or singular residual variation would distort those roots, so the construction has to make the rank assumptions explicit.
[quotetheorem:4061]
[citeproof:4061]
The rank assumptions ensure that the coefficient contrasts and response contrasts represent the intended hypothesis without redundant rows or collapsed response directions. The condition that $E_M$ is positive definite is the GLM analogue of requiring an invertible within-group SSP matrix in one-way MANOVA; without it, the usual roots must be replaced by a restricted or generalized eigenvalue problem. The theorem is important because it shows that two-way MANOVA, regression MANOVA, and repeated-measures profile analysis are not separate theories but different choices of $X$, $C$, and $M$.
[definition: Profile Analysis]
Profile analysis is a repeated-measures MANOVA method in which each row of $Y$ records several measurements on the same unit, and hypotheses are expressed as contrasts among the response coordinates.
[/definition]
In profile analysis, a flat profile asks whether the response mean is constant across measurement occasions, parallel profiles ask whether groups have the same pattern across occasions, and coincident profiles ask whether groups have the same overall profile.
[example: Repeated Measures Profile Analysis]
Suppose patients in two treatments are measured at four visits, so $Y$ has four response columns. Parallelism is tested by choosing $M$ to contain successive-difference contrasts among visits and $C$ to compare the two treatment groups. A significant result means the treatment groups change differently over time, not merely that one group has a larger average response.
[/example]
The previous example fixes one test case for the idea. The next example, Two-Way MANOVA, changes the setting so the same mechanism can be recognized from another angle.
[example: Two-Way MANOVA]
In a study of plant growth, the response vector contains root length, shoot length, and leaf area. The factors are fertiliser type and watering regime. A two-way MANOVA builds $X$ with columns for fertiliser, watering, and their interaction, then tests the interaction SSP matrix against the residual SSP matrix. The structural point is that an interaction may be invisible in every single response coordinate but visible in a contrast such as increased root length paired with reduced leaf area under one watering regime. A significant interaction therefore indicates that the multivariate effect of fertiliser depends on the watering regime, and the fitted root direction tells the analyst which combined growth profile carries that dependence.
[/example]
## Summary Of The MANOVA Workflow
The workflow begins by choosing a multivariate model whose response vector and grouping structure match the scientific design. The data are reduced to an error SSP matrix and a hypothesis SSP matrix. The eigenvalues of this pair describe the dimensions along which group structure explains response variation beyond residual scatter.
Wilks' statistic is tied to the likelihood-ratio principle and equals $\prod_k(1-\hat\rho_k^2)$. Pillai's trace sums the squared canonical correlations, Lawley-Hotelling sums the roots, and Roy uses the largest root. After a global test, simultaneous intervals, planned contrasts, and step-down analyses explain which variables or response directions produced the multivariate rejection. The same construction carries from one-way MANOVA to two-way designs, the multivariate general linear model, and profile analysis.
MANOVA treats multiple responses jointly, but many scientific studies are better described by a full matrix model with several related equations. The multivariate linear model and generalized least squares extend the regression framework so that cross-response covariance can be modeled rather than ignored.
# 11. The Multivariate Linear Model and Generalised Least Squares
Building on the MANOVA framework of Chapter 10, this chapter extends the univariate linear model to the setting where each experimental unit carries a vector response. The central object is a rectangular response matrix $Y$, whose rows are observations and whose columns are response variables measured on the same units. The main point is that regression, covariance estimation, MANOVA, and prediction can all be expressed by the same projections used earlier, with the residual covariance matrix carrying the dependence between responses.
## Why a Matrix Response Changes the Linear Model
A univariate regression treats each response coordinate separately, but multivariate data often record several outcomes on the same individual. The question is how much of the ordinary least squares theory survives when the errors in different response coordinates are correlated. The answer is that the coefficient estimator has the same projection form under common regressors, while its uncertainty and the natural tests depend on the response covariance matrix.
[definition: Multivariate Linear Model]
Let $Y \in \mathbb R^{n \times p}$ be the response matrix, let $X \in \mathbb R^{n \times q}$ have rank $q$, let $B \in \mathbb R^{q \times p}$, and let $E \in \mathbb R^{n \times p}$. The multivariate linear model is
\begin{align*}
Y = XB + E,
\end{align*}
with
\begin{align*}
E \sim MN_{n \times p}(0, I_n, \Sigma),
\end{align*}
where $\Sigma \in \mathbb R^{p \times p}$ is positive definite.
[/definition]
The notation $MN_{n \times p}(M,R,\Sigma)$ denotes the matrix normal distribution with mean matrix $M$, row covariance $R$, and column covariance $\Sigma$; equivalently, under column-stacking, $\operatorname{vec}(E)$ has covariance $\Sigma\otimes R$. Thus $MN_{n \times p}(0,I_n,\Sigma)$ means that the rows of $E$ are independent $N_p(0,\Sigma)$ random vectors: observations are independent across units, but the $p$ response coordinates within the same unit may be correlated.
[example: Two Responses With Common Regressors]
Suppose $Y_{i1}$ is systolic blood pressure and $Y_{i2}$ is cholesterol for subject $i$, and $X$ contains an intercept, age, and treatment status. The coefficient matrix $B$ has one column for blood pressure regression coefficients and one column for cholesterol regression coefficients. If the two physiological outcomes are positively correlated after adjusting for age and treatment, then $\Sigma_{12}>0$, but the same design matrix $X$ is used in both response equations.
[/example]
The first estimation question is whether the shared response covariance forces a different coefficient formula from ordinary least squares. Because every response column uses the same design matrix, the residual cross-products couple the equations only through the covariance estimate, not through the normal equations for $B$. The theorem below identifies the coefficient estimator in this common-design setting.
[quotetheorem:4062]
[citeproof:4062]
This is the first structural simplification of the chapter: correlated response errors change the covariance of the estimator, not its algebraic form. The fitted values are $\hat Y=HY$, where $H=X(X^\top X)^{-1}X^\top$ is the projection matrix onto $\operatorname{col}(X)$. Thus the geometry of fitting is still projection onto the column space of the regressors, applied simultaneously to all response coordinates. The covariance matrix $\Sigma$ will reappear when measuring uncertainty in $\hat B$ and in the residual cross-product matrix, but it does not change the fitted mean surface itself.
## How the Least Squares Estimator Varies From Sample to Sample
Once the estimator has the same form as in ordinary least squares, the next question is its sampling distribution. The coefficient matrix is random because it is a linear transformation of $Y$, and the residual covariance estimator is random because it is a quadratic form in the residuals.
[quotetheorem:4063]
[citeproof:4063]
The distribution statement separates two kinds of randomness. The coefficient estimator varies through linear combinations of the response errors, while the residual cross-product varies through the part of the data left after projection away from $\operatorname{col}(X)$. This separation is what makes the multivariate linear model parallel the one-response normal linear model, with matrices replacing scalars.
Inference for the coefficient matrix also requires an estimate of the common response covariance $\Sigma$. The residual matrix
\begin{align*}
\hat E = Y-X\hat B = (I_n-H)Y
\end{align*}
contains exactly the variation left after fitting the design, and its available degrees of freedom are $n-q$. The following definition fixes both the likelihood scaling and the unbiased scaling used later in tests and prediction regions.
[definition: Residual Covariance Estimator]
In the multivariate linear model, define
\begin{align*}
\hat\Sigma = \frac{1}{n}(Y-X\hat B)^\top(Y-X\hat B)
\end{align*}
for the maximum likelihood estimator of $\Sigma$, and define
\begin{align*}
S = \frac{1}{n-q}(Y-X\hat B)^\top(Y-X\hat B)
\end{align*}
for the unbiased residual covariance estimator.
[/definition]
The factor $n$ gives the likelihood maximiser, while $n-q$ corrects the bias caused by estimating $q$ regression coefficients per response. To justify the correction, we need the residual cross-product to have the same Wishart structure as a covariance matrix built from $n-q$ independent residual directions. Under the normal multivariate linear model with full-rank design, the residual sum of squares $(Y-X\hat B)^\top(Y-X\hat B)$ has distribution $W_p(n-q,\Sigma)$ in the notation of this chapter and is independent of $\hat B$. This is the regression analogue of the sample covariance theorem: fitted directions consume degrees of freedom, and the remaining orthogonal residual directions carry the covariance information.
[example: Estimating A Two-Dimensional Residual Covariance]
In the blood pressure and cholesterol example with $q=3$ regressors and $n$ subjects, the residual cross-product is a $2\times2$ matrix. Its diagonal entries measure unexplained variation in each outcome, while the off-diagonal entry measures residual association between outcomes after removing the fitted age and treatment effects. Dividing by $n-3$ gives an unbiased estimate of $\Sigma$.
[/example]
## How Linear Hypotheses Become MANOVA Problems
In multivariate regression, a hypothesis may constrain both the rows and the columns of $B$. The natural question is how to test a statement such as several treatment contrasts having no effect on a chosen set of response combinations. This is handled by the matrix hypothesis $CBM=0$, which reduces the problem to a MANOVA comparison of hypothesis and error sums of squares.
[definition: General Linear Hypothesis In The Multivariate Linear Model]
Let $C \in \mathbb R^{r \times q}$ have rank $r$, and let $M \in \mathbb R^{p \times s}$ have rank $s$. A general linear hypothesis in the multivariate linear model is
\begin{align*}
H_0: CBM = 0.
\end{align*}
[/definition]
The matrix $C$ selects or contrasts regression coefficients, while $M$ selects or contrasts response variables. Taking $M=I_p$ tests a regression effect across all responses; taking $C$ to select treatment coefficients and $M$ to compare time points gives repeated-measures hypotheses.
The hypothesis can be reduced to the same hypothesis-error matrix pair used in MANOVA. Define
\begin{align*}
H_M &= M^\top \hat B^\top C^\top\{C(X^\top X)^{-1}C^\top\}^{-1}C\hat B M,\\
E_M &= M^\top(Y-X\hat B)^\top(Y-X\hat B)M,
\end{align*}
and set $\Omega_M=M^\top\Sigma M$. Under $H_0$, the normal model gives independent Wishart matrices
\begin{align*}
H_M\sim W_s(r,\Omega_M),\qquad E_M\sim W_s(n-q,\Omega_M),
\end{align*}
in the degrees-first convention used throughout this chapter, provided the stated rank conditions hold. The standard test statistics are then the Chapter 10 root summaries, now applied to the eigenvalues of $E_M^{-1}H_M$ when $E_M$ is invertible: Wilks lambda, Pillai trace, Lawley-Hotelling trace, and Roy largest root. The chapter uses this reduction to avoid rederiving separate tests for every multivariate regression hypothesis.
[example: Growth Curve Model For Repeated Measurements]
Suppose $Y_{ij}$ is the measurement on subject $i$ at time $t_j$, so each row of $Y$ is a longitudinal profile. A growth curve model may write $B=\Gamma A^\top$, where the columns of $A$ are known time basis vectors such as $1$, $t$, and $t^2$, and $\Gamma$ contains regression effects on those growth coefficients. Testing whether treatment changes the linear trend is a hypothesis of the form $CBM=0$, where $C$ selects the treatment row in $\Gamma$ and $M$ extracts the chosen time contrast.
[/example]
## Why Generalised Least Squares Can Improve Seemingly Unrelated Regressions
The previous sections used a common regressor matrix $X$ for every response. In many applications the responses are different equations with different covariates, but their errors are correlated because the equations are measured on the same units. The question is whether estimating the equations jointly improves efficiency.
[definition: Seemingly Unrelated Regression System]
A seemingly unrelated regression system has equations
\begin{align*}
y_j = X_j\beta_j + \varepsilon_j, \qquad j=1,\dots,p,
\end{align*}
where $y_j \in \mathbb R^n$, $X_j \in \mathbb R^{n \times q_j}$, and the row vectors $(\varepsilon_{i1},\dots,\varepsilon_{ip})$ are independent with covariance matrix $\Sigma$.
[/definition]
Stacking the equations gives a single linear model with block-diagonal design matrix and non-spherical covariance. If $\Omega=\Sigma\otimes I_n$ is the covariance matrix of the stacked error vector, the generalised least squares estimator is
\begin{align*}
\hat\beta_{\mathrm{GLS}}=(Z^\top\Omega^{-1}Z)^{-1}Z^\top\Omega^{-1}y.
\end{align*}
The reason for introducing GLS here is to determine when cross-equation covariance changes the fitted coefficients rather than only their reported uncertainty. In a seemingly unrelated system the answer depends on whether the equations share the same regressors. The next issue is therefore a structural one: identify the exact condition under which GLS collapses to separate least squares, and when it genuinely pools information across equations.
[quotetheorem:4066]
[citeproof:4066]
The point of this comparison is not that cross-equation covariance can be ignored. It says that when the regressors differ by equation, the covariance matrix determines how information is pooled across equations, so GLS can change the coefficient estimates as well as their standard errors. This is the setting in which seemingly unrelated regression differs from fitting each equation in isolation.
The multivariate linear model is the special case where every response equation has the same design matrix. In that case the GLS weighting has no room to reassign observations differently across equations, so the point estimate should collapse back to the ordinary least-squares matrix estimator. The following result records that equivalence and explains why the earlier estimator did not involve $\Sigma$.
[quotetheorem:4067]
[citeproof:4067]
This equivalence explains why $\hat B$ in the multivariate linear model does not depend on $\Sigma$. GLS is still needed for correct standard errors and tests, but common regressors remove the efficiency gain in point estimation.
[example: Reduced-Rank Regression As A Constrained Multivariate Linear Model]
In reduced-rank regression, the coefficient matrix is constrained by $\operatorname{rank}(B)\le k$ for some $k<\min(q,p)$. This says the fitted response vectors depend on the regressors through only $k$ latent linear combinations. The estimator is no longer the unrestricted $\hat B$; it is obtained by approximating the ordinary fitted values by a rank-$k$ coefficient structure, typically using an eigenvalue or singular value decomposition involving $Y^\top HY$ and the residual covariance estimate.
[/example]
## How To Predict A New Multivariate Observation
After estimating the model, the practical question is how to predict the response vector for a new row of regressors. Because the response is multivariate, uncertainty is described by an ellipsoid rather than an interval.
[definition: Predicted Mean At A New Design Point]
For $x_0 \in \mathbb R^q$, the predicted mean response in the multivariate linear model is
\begin{align*}
\hat\mu_0 = x_0^\top\hat B \in \mathbb R^p.
\end{align*}
[/definition]
The uncertainty in $\hat\mu_0$ depends on the leverage
\begin{align*}
h_0=x_0^\top(X^\top X)^{-1}x_0.
\end{align*}
For a future observation $Y_0=x_0^\top B+\varepsilon_0$, the additional random error contributes another copy of $\Sigma$.
Point prediction is not enough when the response is vector-valued: the uncertainty has both a design component and a covariance-shape component across responses.
The next inferential problem is to turn those two sources of uncertainty into a simultaneous region for the whole future response vector. Such a region must combine leverage, residual covariance, and the extra noise of a future observation, rather than assemble separate marginal intervals for each coordinate.
[quotetheorem:4068]
[citeproof:4068]
The region is an ellipsoid centred at $\hat\mu_0$. Its axes are determined by the eigenvectors of $S$, while its overall size grows with $1+h_0$, so predictions far from the centre of the observed design are less precise.
[example: Prediction Ellipse For Two Future Outcomes]
With two responses, the prediction region is an ellipse in the $(Y_1,Y_2)$-plane. If the residual covariance estimate has positive off-diagonal entry, the ellipse is tilted upward, indicating that high predicted errors in one coordinate tend to accompany high predicted errors in the other. A new design point with large leverage produces a larger ellipse even when the residual covariance matrix is unchanged.
[/example]
## What This Chapter Adds To The Course
The multivariate linear model is the regression version of the sampling theory developed earlier for the multivariate normal and Wishart distributions. Ordinary least squares supplies the fitted mean matrix, the Wishart distribution supplies covariance estimation, and MANOVA supplies tests for matrix-valued hypotheses. Generalised least squares then explains when cross-equation covariance improves efficiency and why common regressors are a special case where the coefficient estimate remains unchanged.
The multivariate linear model works well when dimension is moderate and classical covariance estimators behave stably, but that assumption breaks down in modern problems with many variables. High-dimensional asymptotics explains what changes when $p$ is not small relative to $n$, and it prepares us for more flexible tools beyond the classical regime.
# 12. High-Dimensional Asymptotics and Modern Extensions
High-dimensional multivariate statistics begins when the number of variables is no longer small compared with the number of observations. In the preceding chapters, estimators such as the sample covariance matrix and PCA were analysed in a regime where $p$ was fixed and $n$ grew. This chapter asks what changes when $p/n$ has a non-zero limit, and why classical intuition about eigenvalues, inversion, and dimension reduction can fail in that regime.
## The Curse of Dimensionality for Covariance Matrices
What goes wrong with the sample covariance matrix when the ambient dimension is comparable to, or larger than, the sample size? Let $X_1,\dots,X_n \in \mathbb R^p$ be observations with sample mean $\bar X = n^{-1}\sum_{i=1}^n X_i$ and sample covariance
\begin{align*}
S = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)(X_i-\bar X)^\top.
\end{align*}
Classical multivariate procedures often use $S^{-1}$, so singularity and poor conditioning become statistical problems, not just numerical ones.
[quotetheorem:4069]
[citeproof:4069]
This rank bound explains why high-dimensional covariance estimation cannot be treated as a minor perturbation of the fixed-$p$ theory. If $p\ge n$, any method requiring $S^{-1}$ must be regularised, projected, or replaced.
The point is not merely that numerical software may fail to invert a matrix. The data themselves contain fewer independent centred directions than the ambient space has coordinates, so some linear combinations of variables have zero observed variation no matter how the covariance entries are arranged. In applications this means that classical formulas based on Mahalanobis distance, Gaussian likelihoods, or minimum-variance weights are asking the sample covariance to supply information in directions that the sample has not observed. Regularisation methods such as ridge adjustments, shrinkage covariance estimates, factor models, or dimension reduction should therefore be viewed as modelling choices that add structure beyond the raw sample covariance, not as harmless computational patches.
Even when $p<n$ and the sample covariance is technically invertible, the same instability can appear in a less extreme form if its smallest eigenvalues are tiny compared with its largest eigenvalues. To decide whether inversion is merely possible or actually reliable, we need a scalar diagnostic that measures how uneven the spectrum is. That diagnostic is the condition number.
[definition: Condition Number]
Let $A\in \mathbb R^{p\times p}$ be symmetric positive definite with eigenvalues $0<\lambda_1\le \dots \le \lambda_p$. The condition number of $A$ is
\begin{align*}
\kappa(A)=\frac{\lambda_p}{\lambda_1}.
\end{align*}
[/definition]
A large condition number means that inversion amplifies small perturbations. In multivariate analysis this affects Mahalanobis distances, discriminant analysis, portfolio weights, and any likelihood calculation involving $\det S$ or $S^{-1}$.
[example: Singular Portfolio Covariance]
Suppose daily returns are recorded for $p=500$ assets over $n=250$ days. The centred return matrix has at most rank $249$, so the sample covariance matrix has at least $251$ zero eigenvalues. Minimum-variance portfolio weights proportional to $S^{-1}u$, where $u=(1,\dots,1)^\top$, are therefore undefined unless the covariance estimate is regularised. This example shows that the failure is structural: collecting fewer observations than assets cannot identify all covariance directions from the sample covariance alone.
[/example]
## Spectral Limits and the Marchenko-Pastur Law
If the true covariance is the identity, should the eigenvalues of $S$ cluster tightly around $1$? In fixed dimension the answer is yes as $n\to\infty$, but high-dimensional asymptotics produces a spread-out spectrum even under pure noise. The Marchenko-Pastur law describes this limiting empirical distribution of sample eigenvalues.
[definition: Empirical Spectral Distribution]
Let $A_p\in \mathbb R^{p\times p}$ be symmetric with eigenvalues $\lambda_1,\dots,\lambda_p$. The empirical spectral distribution of $A_p$ is the probability measure
\begin{align*}
F^{A_p} = \frac{1}{p}\sum_{j=1}^p \delta_{\lambda_j}.
\end{align*}
[/definition]
The empirical spectral distribution records the whole bulk of eigenvalues, not only the largest or smallest. Once the spectrum is viewed as a probability measure, the central question becomes whether this random measure has a deterministic high-dimensional limit under a pure-noise model. The Marchenko-Pastur law answers that question and gives the benchmark bulk against which empirical eigenvalues are judged.
[quotetheorem:4070]
[citeproof:4070]
The theorem says that noise eigenvalues fill an interval rather than collapsing to $1$. For example, when $p/n\to 1$, the limiting support is $[0,4]$, so very small and quite large sample eigenvalues arise even from spherical data.
[example: Noise Eigenvalue Spread]
Take $p=500$, $n=1000$, and $X_i\sim \mathcal N(0,I_p)$ independently. Here $c=1/2$, so the Marchenko-Pastur support is approximately
\begin{align*}
[(1-\sqrt{1/2})^2,(1+\sqrt{1/2})^2]\approx [0.086,2.914].
\end{align*}
Thus a sample principal component with eigenvalue near $2.5$ need not indicate a genuine factor. Its size is compatible with high-dimensional sampling noise.
[/example]
## Spikes and Detectable Principal Components
When does a population principal component rise above the random matrix noise bulk? The spiked covariance model gives a controlled setting in which a small number of signal eigenvalues are added to an otherwise spherical covariance. The main lesson is that weak spikes are not merely hard to estimate; below a threshold they merge with the noise edge.
[definition: Spiked Covariance Model]
For fixed $r\ge 1$, the spiked covariance model assumes observations $X_1,\dots,X_n\in\mathbb R^p$ are independent with mean $0$ and covariance matrix
\begin{align*}
\Sigma = \operatorname{diag}(\theta_1,\dots,\theta_r,1,\dots,1),
\end{align*}
where $\theta_1\ge \dots\ge \theta_r>1$ are the population spikes.
[/definition]
The model isolates the effect of a low-rank signal. The statistical question is not only whether a spike exists in the population covariance, but whether its sample eigenvalue separates from the noise bulk strongly enough to be detected by PCA. The separation threshold is the key fact needed for interpreting empirical principal components.
[quotetheorem:4071]
[citeproof:4071]
This is the Baik-Ben Arous-Peche transition as used in Johnstone's spiked covariance framework. The proof uses random matrix tools beyond the core course, so we use the statement as a guide to interpreting PCA in high dimension.
[example: Weak and Strong Spikes]
Let $p/n\to 1/4$, so the detection threshold is $1+\sqrt{1/4}=1.5$. A population eigenvalue $\theta=1.3$ is absorbed into the top of the noise bulk, while $\theta=2$ separates and produces a sample eigenvalue tending to $2(1+1/4)=2.5$. This illustrates why a visible empirical eigengap has to be judged relative to the noise edge, not relative to $1$.
[/example]
## Shrinkage and Regularised Covariance Estimation
How can we estimate covariance matrices when $S$ is singular or too unstable to invert? A common strategy is to shrink the noisy sample covariance toward a structured target, typically a multiple of the identity. This trades a small amount of bias for a large reduction in variance and conditioning problems.
[definition: Ledoit-Wolf Shrinkage Estimator]
Let $S$ be the sample covariance matrix and let
\begin{align*}
\hat\mu = \frac{1}{p}\operatorname{tr}(S).
\end{align*}
For $\alpha\in[0,1]$, the Ledoit-Wolf linear shrinkage estimator toward the identity is
\begin{align*}
\hat\Sigma_{LW}=\alpha\hat\mu I+(1-\alpha)S.
\end{align*}
[/definition]
The target $\hat\mu I$ preserves the average sample variance while removing unstable covariance directions. If $\alpha>0$ and $\hat\mu>0$, the estimator is positive definite even when $S$ is singular. The remaining modelling problem is how much shrinkage to use: too little leaves the covariance unstable, while too much erases real dependence. The Ledoit-Wolf result gives a data-driven calibration within this linear shrinkage family.
[quotetheorem:4072]
[citeproof:4072]
The theorem does not claim that the identity target is always the best structured model. Its message is that, once this shrinkage family has been chosen, the shrinkage level can be calibrated rather than tuned by guesswork.
[example: Regularised Portfolio Covariance]
Return again to $p=500$ assets and $n=250$ observations. The sample covariance matrix is singular, but $\hat\Sigma_{LW}$ has eigenvalues $\alpha\hat\mu+(1-\alpha)\lambda_j(S)$. Zero eigenvalues of $S$ are moved to $\alpha\hat\mu$, so the inverse exists when $\alpha>0$ and $\hat\mu>0$. In portfolio optimisation, this prevents the optimiser from exploiting null directions created by insufficient data.
[/example]
## Sparse Principal Components
What if a principal component is statistically present but scientifically unreadable because it uses nearly every variable? Sparse PCA modifies variance maximisation by asking for components with many zero loadings. This is especially important in genomics and other settings where interpretation requires identifying a small set of active features.
[definition: Sparse PCA Objective]
Given a sample covariance matrix $S\in\mathbb R^{p\times p}$, a sparse first principal component may be estimated by solving a penalised problem of the form
\begin{align*}
\max_{v\in\mathbb R^p}\; v^\top S v - \lambda \|v\|_1
\qquad \text{subject to } |v|=1,
\end{align*}
where $\lambda\ge 0$ controls sparsity.
[/definition]
The constraint keeps the loading vector on the Euclidean unit sphere, while the penalty discourages many small non-zero coordinates. The optimisation is non-convex, so algorithms differ in how they approximate or relax the problem.
[explanation: Sparse PCA via LASSO on the SVD]
One approach starts from the singular value decomposition of the centred data matrix and alternates between low-rank approximation and LASSO-type regression. The ordinary PCA loading is treated as a response direction, and an $\ell^1$ penalty is imposed on the regression coefficients to induce zeros. After thresholding and normalisation, the resulting loading vector is sparse but still aligned with a high-variance direction of the data. This viewpoint connects sparse PCA to familiar penalised regression methods while retaining the geometric interpretation of PCA.
[/explanation]
The explanation gives the organizing idea; the example now supplies a test case. This keeps the next construction tied to computations rather than only to terminology.
[example: PCA and Sparse PCA on Gene Data]
Suppose expression levels for $p=10{,}000$ genes are measured on $n=200$ samples from two biological conditions. Ordinary PCA may produce a leading loading vector with thousands of small non-zero entries, making it difficult to associate the component with a pathway. Sparse PCA with a tuned penalty can produce a component supported on a few dozen genes while still separating the two conditions in the score plot. The example illustrates the tradeoff: sparse PCA sacrifices some explained variance to gain interpretability and stability.
[/example]
## Largest Eigenvalues and Tracy-Widom Fluctuations
After the Marchenko-Pastur theorem identifies the bulk edge, what is the scale of random fluctuation of the largest noise eigenvalue around that edge? Classical central limit scaling is not the right answer. Random matrix theory predicts a different fluctuation law at the spectral edge.
[quotetheorem:4073]
[citeproof:4073]
This result is quoted as a preview rather than proved in this course. Its role is to explain why high-dimensional PCA tests often compare the largest observed eigenvalue to a random matrix edge distribution rather than to a normal approximation.
[remark: Interpreting the Edge]
The Marchenko-Pastur theorem gives the deterministic first-order location of the noise edge, while the Tracy-Widom law gives the second-order fluctuation scale near that edge. The BBP transition then describes what happens when a signal eigenvalue attempts to cross the same boundary. Together these results form the modern random matrix view of PCA.
[/remark]
High-dimensional asymptotics changes the default interpretation of multivariate methods. Singular covariance matrices, spread-out noise spectra, phase transitions for signal detection, shrinkage estimators, and sparse components are not specialised exceptions; they are the standard behaviour when $p$ is of the same order as $n$. These ideas frame modern multivariate workflows, where modelling choices are judged by stability, prediction, and interpretability as much as by classical likelihood theory.
High-dimensional theory shows that dependence structures can become unstable and hard to estimate when the number of variables is large. Copulas and related methods address this by separating marginal behavior from dependence itself, giving a framework that extends multivariate modeling beyond the normal case.
# 13. Copulas and Dependence Beyond Normality
This chapter builds on the multivariate normal model from Chapter 2, the covariance-based dependence methods from Chapters 6-8, and the rank and marginal distribution ideas needed to separate margins from dependence. The main question is how much of a joint model is marginal behaviour, and how much is dependence. The multivariate normal ties these two pieces together very tightly, since covariance controls both linear association and the full joint law. In many applications, especially finance, hydrology, and environmental risk, the marginal distributions are skewed or heavy-tailed while the dependence has its own structure. Copulas separate these roles by representing a joint distribution through its marginal distribution functions and a dependence object on the unit cube.
## Separating Margins from Dependence
If $X=(X_1,\dots,X_p)$ is a random vector, its joint distribution function $F$ contains both the marginal distributions $F_1,\dots,F_p$ and the dependence between coordinates. The guiding problem is whether we can keep the margins fixed while changing the dependence, or compare dependence across data sets with different units and marginal scales.
[definition: Copula]
A $p$-dimensional copula is a distribution function $C:[0,1]^p\to[0,1]$ whose one-dimensional marginal distribution functions are uniform on $[0,1]$.
[/definition]
Thus a copula is a multivariate distribution on the unit cube with standard uniform margins. It records the dependence structure after each coordinate has been transformed to its marginal probability scale.
[example: Independent Copula]
Let $U_1,\dots,U_p$ be independent random variables with $U_j\sim \operatorname{Unif}(0,1)$. Their joint distribution function is
\begin{align*}
C(u_1,\dots,u_p)=\prod_{j=1}^p u_j,
\end{align*}
for $0\le u_j\le 1$. This copula represents independence, since applying arbitrary continuous marginal inverse distribution functions to the $U_j$ changes the margins but keeps independence.
[/example]
The independent copula shows one extreme: after moving to probability scale, the coordinates have no remaining dependence. The central theorem says that this probability-scale viewpoint is not a special trick for independence, but a general representation of joint laws. It also explains exactly where uniqueness enters, which matters when some margins have atoms.
[quotetheorem:4074]
[citeproof:4074]
Sklar's theorem shows that modelling can be done in two stages: choose or estimate marginal distributions, then choose a copula for dependence. The continuity hypothesis is the reason the copula is unique: if each $F_j$ has no jumps, then its range fills $[0,1]$ densely enough to determine $C$ everywhere by right-continuity. With discrete or mixed margins, the formula only determines $C$ on the product of the marginal ranges, so different copulas can agree on all observable joint probabilities. This non-uniqueness is not a defect in the representation, but it is a limitation when interpreting a fitted copula as the dependence structure rather than as one dependence structure compatible with the margins. The separation is especially useful when normal margins are poor but rank-based dependence is stable.
[example: Constructing a Joint Distribution from a Copula]
Let $C$ be the independent copula and let $F_1$ be an exponential distribution function while $F_2$ is a Student $t$ distribution function. Then
\begin{align*}
F(x_1,x_2)=C(F_1(x_1),F_2(x_2))=F_1(x_1)F_2(x_2)
\end{align*}
defines a bivariate distribution with exponential first margin, Student $t$ second margin, and independence. Replacing $C$ while keeping $F_1$ and $F_2$ fixed changes dependence without changing the two marginal distributions.
[/example]
## Parametric Copula Families
The modelling question is how to choose a dependence family rich enough to capture observed association but simple enough to estimate. The main families in this course are elliptical copulas, inherited from multivariate normal and Student $t$ distributions, and Archimedean copulas, built from one-dimensional generator functions.
[definition: Gaussian Copula]
Let $R\in\mathbb R^{p\times p}$ be a correlation matrix, and let $\Phi_R$ be the distribution function of $N_p(0,R)$. The Gaussian copula with correlation matrix $R$ is the map $C_R:[0,1]^p\to[0,1]$ defined by
\begin{align*}
C_R(u_1,\dots,u_p)=\Phi_R(\Phi^{-1}(u_1),\dots,\Phi^{-1}(u_p)),
\end{align*}
where $\Phi$ is the standard normal distribution function.
[/definition]
The Gaussian copula preserves the dependence geometry of the multivariate normal after removing the normal margins. It is symmetric and convenient, but its tail behaviour is often too weak for joint extremes.
To model variables that can move together in the tails while keeping an elliptical dependence structure, we need a copula built from a distribution with heavier joint tails. The Student $t$ construction keeps the correlation-matrix parameter but replaces the normal quantiles by Student $t$ quantiles.
[definition: Student t Copula]
Let $R\in\mathbb R^{p\times p}$ be a correlation matrix and let $\nu>0$. If $t_{\nu,R}$ is the distribution function of the centred $p$-variate Student $t$ distribution with correlation matrix $R$, and $t_\nu$ is the univariate Student $t$ distribution function, then the Student $t$ copula is the map $C_{\nu,R}:[0,1]^p\to[0,1]$ defined by
\begin{align*}
C_{\nu,R}(u_1,\dots,u_p)=t_{\nu,R}(t_\nu^{-1}(u_1),\dots,t_\nu^{-1}(u_p)).
\end{align*}
[/definition]
Unlike the Gaussian copula, the Student $t$ copula can put substantial probability in joint tails. The parameter $\nu$ controls tail thickness, while $R$ controls the central elliptical dependence pattern.
Elliptical copulas are still restrictive because dependence is organized through a correlation matrix. For low-dimensional models where one wants a generator-driven family with flexible asymmetric tail behaviour, the formal object to introduce is a copula generated from a one-dimensional function rather than from an elliptical distribution.
This raises a different modelling problem: how can a single scalar mechanism combine several marginal ranks into a valid joint distribution while keeping the construction easy to simulate and interpret? Archimedean copulas answer this by replacing the correlation-matrix parameter with a generator whose inverse transforms marginal probabilities onto an additive scale.
[definition: Archimedean Copula]
Let $\psi:[0,\infty)\to[0,1]$ be a continuous decreasing completely monotone generator satisfying $\psi(0)=1$ and $\lim_{t\to\infty}\psi(t)=0$, with generalized inverse $\psi^{-1}$. The Archimedean copula generated by $\psi$ is the map $C:[0,1]^p\to[0,1]$ defined by
\begin{align*}
C(u_1,\dots,u_p)=\psi(\psi^{-1}(u_1)+\cdots+\psi^{-1}(u_p)).
\end{align*}
[/definition]
The Archimedean form reduces multivariate dependence to a single generator, so it is attractive for low-dimensional modelling and simulation. Different generators emphasize different forms of asymmetric tail association.
[example: Clayton Copula]
For $\theta>0$, the Clayton generator is $\psi(t)=(1+\theta t)^{-1/\theta}$. The bivariate Clayton copula is
\begin{align*}
C_\theta(u,v)=(u^{-\theta}+v^{-\theta}-1)^{-1/\theta}.
\end{align*}
It has stronger lower-tail association than upper-tail association, so it is useful when small values tend to occur together, such as low rainfall or simultaneous credit deterioration.
[/example]
Clayton is the standard example where dependence is not symmetric between the two tails. The next two families show that the generator can instead emphasize upper-tail clustering or avoid tail dependence while still producing nonzero rank association.
[example: Gumbel and Frank Copulas]
The Gumbel copula, with parameter $\theta\ge 1$, is generated by $\psi(t)=\exp(-t^{1/\theta})$ and has upper-tail dependence. The Frank copula, with $\theta\ne0$, has generator
\begin{align*}
\psi(t)=-\frac{1}{\theta}\log(1-(1-e^{-\theta})e^{-t}),
\end{align*}
and gives a symmetric dependence family without tail dependence. These examples show that similar values of a rank correlation can correspond to different extremal behaviour.
[/example]
Archimedean copulas are also useful because several of them have direct simulation schemes. The Clayton case illustrates the common idea: introduce a shared latent variable, then make the coordinates conditionally independent given that variable.
[example: Clayton Copula Simulation]
In the bivariate Clayton model with parameter $\theta>0$, a simulation can be produced through the frailty representation. Let $S\sim \operatorname{Gamma}(1/\theta,1)$ and let $E_1,E_2$ be independent $\operatorname{Exp}(1)$ random variables independent of $S$. Then
\begin{align*}
U_j=(1+E_j/S)^{-1/\theta},\qquad j=1,2,
\end{align*}
has Clayton copula $C_\theta$. The construction makes the shared positive random variable $S$ the source of dependence, and small values of $S$ generate simultaneous lower-tail events.
[/example]
## Rank-Based Measures of Dependence
Pearson correlation is natural for elliptical models, but it can be misleading when margins are nonlinear, skewed, or heavy-tailed. Copulas motivate dependence measures based on ranks, because ranks depend on the marginal ordering rather than the measurement scale.
[definition: Kendall Tau]
Let $(X,Y)$ and $(X',Y')$ be independent copies of a continuous bivariate random vector. Kendall's $\tau$ is
\begin{align*}
\tau=\mathbb P((X-X')(Y-Y')>0)-\mathbb P((X-X')(Y-Y')<0).
\end{align*}
[/definition]
Kendall's $\tau$ measures the difference between concordance and discordance probabilities. Since monotone transformations preserve the ordering of each margin, $\tau$ depends only on the copula.
Pairwise concordance is not the only rank-based way to summarize dependence. When we want a correlation-like measure on the probability scale itself, the relevant objects are the transformed coordinates $F_X(X)$ and $F_Y(Y)$, and this leads to Spearman's $\rho$.
[definition: Spearman Rho]
Let $(X,Y)$ have continuous marginal distribution functions $F_X$ and $F_Y$. Spearman's $\rho$ is the Pearson correlation of the transformed variables $F_X(X)$ and $F_Y(Y)$:
\begin{align*}
\rho_S=\operatorname{Corr}(F_X(X),F_Y(Y)).
\end{align*}
[/definition]
Spearman's $\rho$ is also copula-based, since it is the correlation of the two probability-scale coordinates. In data analysis, both $\tau$ and $\rho_S$ are estimated from ranks and are robust to monotone marginal transformations. To compare these rank summaries with the ordinary correlation parameter used in Gaussian models, we need the special calibration formulas that hold for the bivariate normal distribution.
[quotetheorem:4075]
[citeproof:4075]
These formulas provide a calibration between rank dependence and Pearson correlation in the normal model. The bivariate normal hypothesis is essential: the same Pearson correlation does not determine Kendall's $\tau$ or Spearman's $\rho_S$ for a general joint distribution. For example, a Gaussian copula and a Clayton copula can be chosen to have the same Kendall $\tau$, but their Spearman correlations and tail probabilities need not match. Away from the Gaussian copula, these formulas are best interpreted as model-specific conversions, not as universal identities between rank dependence and linear correlation.
[example: Same Kendall Tau, Different Tail Behaviour]
A Gaussian copula and a Clayton copula can be tuned to have the same Kendall $\tau$. The two models then agree on average concordance, but the Clayton copula gives more simultaneous lower-tail observations. This distinction matters when the target quantity is a joint exceedance probability rather than a central association summary.
[/example]
## Tail Dependence and Joint Extremes
Many multivariate risks are driven by the probability that variables are extreme together. A model can have strong central correlation and still underestimate joint extremes, so we need coefficients that isolate tail co-movement.
[definition: Tail Dependence Coefficients]
Let $(X,Y)$ have continuous marginal distribution functions and copula $C$. The lower and upper tail dependence coefficients are
\begin{align*}
\lambda_L&=\lim_{u\downarrow0}\mathbb P(F_Y(Y)\le u\mid F_X(X)\le u),\\
\lambda_U&=\lim_{u\uparrow1}\mathbb P(F_Y(Y)>u\mid F_X(X)>u),
\end{align*}
provided the limits exist.
[/definition]
These coefficients are copula quantities. In terms of $C$, the lower coefficient is $\lambda_L=\lim_{u\downarrow0}C(u,u)/u$, and the upper coefficient is $\lambda_U=\lim_{u\uparrow1}(1-2u+C(u,u))/(1-u)$ when the limits exist.
The first model check is whether the Gaussian copula, despite allowing nonzero central correlation, can produce positive asymptotic tail co-movement. This question determines whether Gaussian dependence is adequate for applications dominated by paired extremes, so it needs a precise theorem rather than an informal visual intuition about correlation.
[quotetheorem:4076]
[citeproof:4076]
The conclusion does not say that Gaussian copula variables cannot be jointly large. It says that, conditional on one variable being more and more extreme, the probability that the other is comparably extreme tends to zero. The restriction $r\in(-1,1)$ excludes the degenerate boundary cases: at $r=1$ the two probability-scale variables are identical and both tail dependence coefficients are $1$, while at $r=-1$ simultaneous upper-tail or lower-tail extremes cannot occur. This is the concrete modelling warning: a Gaussian copula with any non-perfect correlation can match central association while still assigning asymptotically negligible probability to paired extremes, motivating the Student $t$ and upper-tail Archimedean alternatives below.
[quotetheorem:4077]
[citeproof:4077]
The formula shows that positive tail dependence is not created merely by having heavy-tailed margins; it comes from the common-scale dependence in the Student $t$ copula. The hypotheses again exclude $r=1$, where the variables are comonotone on probability scale, and $r=-1$, where the centred elliptical construction is perfectly negatively dependent. For fixed $r<1$, the coefficient decreases as $\nu$ increases, matching the limiting intuition that the Student $t$ copula approaches the Gaussian copula and loses tail dependence. This theorem is therefore a model-selection tool: if joint extremes are central to the application, estimating $\nu$ is as important as estimating the correlation matrix.
[example: Joint Extremes of Rainfall]
Suppose daily rainfall totals are recorded at two nearby stations and each marginal distribution is fitted by a heavy-tailed gamma or generalized Pareto-type model for positive rainfall. A Gaussian copula can match central rank association but gives $\lambda_U=0$, so it may understate the probability of simultaneous extreme rainfall. A Student $t$ or Gumbel copula can encode positive upper-tail dependence and therefore gives larger estimates for joint flood-risk events.
[/example]
## Estimating Copula Models
In applications the margins and the copula are both unknown. The statistical problem is to estimate them without confusing marginal misspecification with dependence misspecification.
[definition: Inference Functions for Margins]
Inference functions for margins is a two-stage estimation method for a parametric copula model. First, estimate the marginal parameters separately for each coordinate. Second, plug the fitted marginal distribution functions into the copula likelihood and estimate the copula parameters.
[/definition]
The IFM method is computationally convenient because it replaces one large maximum likelihood problem by several smaller marginal problems and one dependence problem. It relies on the margins being specified well enough that the second stage sees approximately uniform probability-scale data.
[definition: Canonical Maximum Likelihood]
Canonical maximum likelihood is a semiparametric copula estimation method in which the marginal distribution functions are replaced by empirical distribution functions or scaled ranks, and the copula parameter is estimated by maximizing the resulting pseudo-likelihood.
[/definition]
CML focuses on dependence and avoids committing to parametric margins. It is well suited when rank information is reliable but marginal tail forms are uncertain.
[example: Pseudo-Observations]
Given observations $X_1,\dots,X_n$ with $X_i=(X_{i1},\dots,X_{ip})$, define scaled ranks
\begin{align*}
\hat U_{ij}=\frac{R_{ij}}{n+1},
\end{align*}
where $R_{ij}$ is the rank of $X_{ij}$ among $X_{1j},\dots,X_{nj}$. A copula density $c_\theta$ can then be fitted by maximizing
\begin{align*}
\sum_{i=1}^n \log c_\theta(\hat U_{i1},\dots,\hat U_{ip}).
\end{align*}
This procedure estimates dependence from the relative positions of observations within each margin.
[/example]
## Finance and the Gaussian Copula Critique
Credit portfolios require probabilities of many defaults occurring together. The Gaussian copula became influential because it turned default correlation into a tractable pricing input, but the same tractability also hid weaknesses in tail modelling and parameter stability.
[definition: One-Factor Gaussian Copula Default Model]
In the one-factor Gaussian copula default model, obligor $j$ defaults by a fixed horizon if
\begin{align*}
Y_j=\sqrt{\rho_j}Z+\sqrt{1-\rho_j}\,\varepsilon_j\le a_j,
\end{align*}
where $Z,\varepsilon_1,\dots,\varepsilon_p$ are independent standard normal random variables, $0\le\rho_j<1$, and $a_j$ is chosen to match the marginal default probability of obligor $j$.
[/definition]
The common factor $Z$ creates dependence among defaults, while the thresholds $a_j$ encode marginal default probabilities. Conditional on $Z$, defaults are independent, which makes portfolio loss distributions computationally manageable.
[example: CDO Tranche Sensitivity]
In a collateralized debt obligation, senior tranche losses occur only when many underlying defaults accumulate. Under a one-factor Gaussian copula, changing the asset correlation parameter changes the probability mass in high-loss portfolio states. Two models can match observed single-name default probabilities but give different tranche prices because they assign different probabilities to clustered defaults.
[/example]
The critique is not that copulas are invalid. The issue is that a Gaussian copula with a small number of correlation parameters can underrepresent tail dependence, contagion, parameter uncertainty, and changes in dependence during stress. Copula modelling is useful when it is treated as a transparent separation of margins and dependence, not as a guarantee that a single historical correlation summarizes future joint extremes.
## References
T. W. Anderson, *An Introduction to Multivariate Statistical Analysis*.
R. J. Muirhead, *Aspects of Multivariate Statistical Theory*.
R. A. Johnson and D. W. Wichern, *Applied Multivariate Statistical Analysis*.
G. A. F. Seber, *Multivariate Observations*.
R. B. Nelsen, *An Introduction to Copulas*.
A. J. McNeil, R. Frey, and P. Embrechts, *Quantitative Risk Management: Concepts, Techniques and Tools*.
Contents
- Introduction
- 1. Foundations of Multivariate Data
- What Changes in Several Dimensions?
- The Statistical Model Behind the Course
- Why the Multivariate Normal Distribution Is Central
- The Four Classical Techniques
- MANOVA as a Unifying Framework
- Recurring Mathematical Tools
- Reading the Course as a Whole
- 2. The Multivariate Normal Distribution
- Describing Joint Gaussian Variation
- Affine Transformations
- Marginal And Conditional Distributions
- Independence And Zero Covariance
- Moment Generating Functions And Quadratic Forms
- 3. Random Samples from the Multivariate Normal
- Estimating Location and Scatter from a Random Sample
- Orthogonal Decomposition of a Normal Sample
- Distribution of the Sample Mean
- Maximum Likelihood Estimation of the Normal Parameters
- Sufficiency of the Mean and Covariance
- 4. The Wishart Distribution
- From Normal Vectors to Random Covariance Matrices
- The Density and the Multivariate Gamma Function
- Sums, Blocks, and Marginals
- Bartlett Decomposition and Exact Simulation
- First and Second Moments
- Spectral Viewpoint and High-Dimensional Preview
- 5. Hotelling's $T^2$ and Inference on the Mean
- The One-Sample Mean Problem
- Deriving The F Reduction
- Comparing Two Multivariate Means
- Confidence Ellipsoids And Simultaneous Inference
- Power And Non-Centrality
- 6. Principal Component Analysis
- Population Principal Components
- Variance Explained and Dimension Reduction
- Covariance PCA and Correlation PCA
- Sample Principal Components
- Asymptotic Distribution of Sample Eigenvalues
- Choosing the Number of Components
- Loadings, Scores, and Biplots
- 7. Factor Analysis
- The Gaussian Factor Model
- Rotational Indeterminacy and Identification
- Communalities and Specific Variances
- Estimating the Factor Model
- Rotating Factors for Interpretability
- Factor Scores
- Factor Analysis and Principal Component Analysis
- 8. Canonical Correlation Analysis
- Partitioned Random Vectors and Cross-Covariance
- Population Canonical Correlations
- Canonical Variates and Orthogonality
- Sample Canonical Correlation Analysis
- Testing Cross-Block Independence
- Connections with PCA and Regression
- 9. Linear Discriminant Analysis
- Fisher Two-Group Linear Discriminant
- Bayes Classification Under Equal Covariances
- Quadratic Discriminant Analysis
- Multi-Group Linear Discriminant Analysis
- Sample LDA And Plug-In Classification
- Error Rates And Mahalanobis Distance
- Regularised Discriminant Analysis
- Summary Of The LDA Framework
- 10. Multivariate Analysis of Variance (MANOVA)
- The One-Way MANOVA Question
- Decomposing Multivariate Variation
- Characteristic Roots And Canonical Correlations
- The Four Classical MANOVA Statistics
- Null Distributions And Approximations
- Post-Hoc Analysis After A Global Rejection
- Robustness And Power Of The Four Tests
- Two-Way MANOVA And The General Linear Model
- Summary Of The MANOVA Workflow
- 11. The Multivariate Linear Model and Generalised Least Squares
- Why a Matrix Response Changes the Linear Model
- How the Least Squares Estimator Varies From Sample to Sample
- How Linear Hypotheses Become MANOVA Problems
- Why Generalised Least Squares Can Improve Seemingly Unrelated Regressions
- How To Predict A New Multivariate Observation
- What This Chapter Adds To The Course
- 12. High-Dimensional Asymptotics and Modern Extensions
- The Curse of Dimensionality for Covariance Matrices
- Spectral Limits and the Marchenko-Pastur Law
- Spikes and Detectable Principal Components
- Shrinkage and Regularised Covariance Estimation
- Sparse Principal Components
- Largest Eigenvalues and Tracy-Widom Fluctuations
- 13. Copulas and Dependence Beyond Normality
- Separating Margins from Dependence
- Parametric Copula Families
- Rank-Based Measures of Dependence
- Tail Dependence and Joint Extremes
- Estimating Copula Models
- Finance and the Gaussian Copula Critique
- References
Multivariate Statistical Analysis
Content
Problems
History
Created by admin on 5/28/2026 | Last updated on 6/1/2026
Prerequisites (0/78 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent