High-Dimensional Statistics II studies what can and cannot be learned from data when the number of parameters is comparable to, or larger than, the sample size. The course focuses on minimax theory, information-theoretic lower bounds, and random matrix methods as complementary tools for understanding the limits of estimation, testing, and recovery in high dimensions. Its central goal is to explain not only the best possible rates, but also why those rates arise, which structural assumptions matter, and where algorithmic performance can fall short of statistical possibility.
The chapters build in a deliberate sequence rather than as a catalogue of isolated results. The opening decision-theoretic material sets up the minimax framework, then Fano’s inequality and Assouad’s lemma develop the main lower-bound templates used throughout the course. These tools are applied to sparse linear models and compressed sensing before the focus shifts to random matrix preliminaries, where the spectral behavior of large sample covariance matrices is introduced. From there the course develops covariance estimation, PCA, and spiked covariance models, including the Marchenko-Pastur law and BBP transitions, and then moves to minimax testing, detection, and contiguity. The final chapters return to comparison and synthesis: they organize rates by the statistical obstruction responsible for them, separate information-theoretic limits from algorithmic phenomena, and use tables only as summaries of the preceding narrative.
# Introduction
This course studies the statistical limits that appear when dimension, sample size, sparsity, and noise scale together. A first course in high-dimensional statistics often emphasizes estimators: thresholding, convex regularisation, sparse regression, and concentration-based upper bounds. Here the central question is complementary: when no procedure can do better, what mathematical mechanism proves that impossibility? The answer combines statistical decision theory, information inequalities, metric entropy, random matrix theory, and the geometry of sparse structures.
The course has two intertwined threads. The first develops minimax lower bounds: reductions from estimation to testing, Le Cam's method, [Fano's inequality](/theorems/1654), [Assouad's lemma](/theorems/5906), and local asymptotic arguments. The second develops random matrix tools: spectral limits, covariance estimation, spiked models, phase transitions, and the probabilistic structure behind compressed sensing. The point of treating these together is that modern high-dimensional statistics often asks both for an optimal rate and for a spectral or geometric explanation of why that rate changes across regimes.
## What Makes a High-Dimensional Problem Hard?
The guiding problem is to distinguish computational difficulty, proof difficulty, and statistical difficulty. An estimator may be hard to compute, an analysis may require delicate inequalities, or the model may contain so little information that every estimator fails at a target accuracy. Minimax theory isolates the last issue by comparing all possible procedures against the worst parameter in a given model class.
Before the formal machinery begins, we need language that separates what is random from what is unknown. The observation model tells us which probability laws could have generated the data; the parameter space records the possible unknowns; and the loss function specifies which errors matter. In high dimension each of these may scale with $n$, so rates are statements about sequences of problems rather than a single fixed experiment. The first definition fixes the probability model on which all later risk comparisons are made.
[definition: Statistical Experiment]
A statistical experiment is a family of probability measures $(\nu_\theta)_{\theta \in \Theta}$ on a measurable sample space $(\mathcal X, \mathcal A)$, where $\Theta$ is the parameter space.
[/definition]
This definition separates the data-generating law from the unknown object being estimated. The next layer records what an estimator is allowed to output and how its accuracy is measured, because lower bounds must apply to every measurable rule with the same target.
[definition: Estimator And Risk]
Let $(\nu_\theta)_{\theta \in \Theta}$ be a statistical experiment, let $(\mathcal T, d)$ be a target space, and let $T: \Theta \to \mathcal T$ be the target functional. An estimator is a measurable map $\hat T: \mathcal X \to \mathcal T$. For a loss function $L: \mathcal T \times \mathcal T \to [0, \infty)$, the risk of $\hat T$ at $\theta$ is
\begin{align*}
R(\hat T, \theta) = \mathbb E_\theta[L(\hat T(X), T(\theta))].
\end{align*}
[/definition]
The expectation $\mathbb E_\theta$ is taken under $X \sim \nu_\theta$. Different choices of $L$ encode different statistical goals, and this is where high-dimensional problems begin to diverge.
[example: Three Losses For Sparse Estimation]
Let $S=\operatorname{supp}\theta$ and estimate the support by the coordinatewise threshold rule
\begin{align*}
\hat S_t=\{j: |X_j|>t\}.
\end{align*}
The three losses ask different questions about the same observation $X=\theta+\sigma Z$. Squared error measures
\begin{align*}
L_2(\hat\theta,\theta)=|\hat\theta-\theta|^2=\sum_{j=1}^d(\hat\theta_j-\theta_j)^2.
\end{align*}
Support recovery loss measures
\begin{align*}
L_{\mathrm{supp}}(\hat S,S)=\mathbb 1_{\{\hat S\ne S\}}.
\end{align*}
Hamming loss measures
\begin{align*}
L_{\mathrm{Ham}}(\hat S,S)=\sum_{j\notin S}\mathbb 1_{\{j\in \hat S\}}+\sum_{j\in S}\mathbb 1_{\{j\notin \hat S\}}.
\end{align*}
For the threshold rule, linearity of expectation gives
\begin{align*}
\mathbb E_\theta[L_{\mathrm{Ham}}(\hat S_t,S)]=\sum_{j\notin S}\mathbb P_\theta(|X_j|>t)+\sum_{j\in S}\mathbb P_\theta(|X_j|\le t).
\end{align*}
If $j\notin S$, then $\theta_j=0$ and $X_j=\sigma Z_j$, so
\begin{align*}
\mathbb P_\theta(|X_j|>t)=\mathbb P(|\sigma Z_j|>t)=\mathbb P(|Z_j|>t/\sigma).
\end{align*}
If $j\in S$ and $|\theta_j|\ge \mu$, put $a=|\theta_j|$ and $W_j=\operatorname{sign}(\theta_j)Z_j$. Since $W_j\sim\mathcal N(0,1)$, we have
\begin{align*}
\mathbb P_\theta(|X_j|\le t)=\mathbb P(|a+\sigma W_j|\le t).
\end{align*}
The event $|a+\sigma W_j|\le t$ is the same as $-t\le a+\sigma W_j\le t$, hence
\begin{align*}
\mathbb P(|a+\sigma W_j|\le t)=\mathbb P\left(\frac{-t-a}{\sigma}\le W_j\le \frac{t-a}{\sigma}\right).
\end{align*}
This interval is contained in the event $W_j\le (t-a)/\sigma$, and $a\ge \mu$, so
\begin{align*}
\mathbb P_\theta(|X_j|\le t)\le \mathbb P\left(W_j\le \frac{t-\mu}{\sigma}\right).
\end{align*}
Thus the Hamming risk is the sum of false positive probabilities over inactive coordinates and false negative probabilities over active coordinates. Exact support recovery is stricter, because
\begin{align*}
\{\hat S_t\ne S\}=\left(\bigcup_{j\notin S}\{|X_j|>t\}\right)\cup \left(\bigcup_{j\in S}\{|X_j|\le t\}\right).
\end{align*}
One coordinate error is enough to fail exact recovery, so those error probabilities must vanish uniformly over all inactive and active coordinates. This is why squared estimation, average coordinate classification, and exact support recovery can have different signal thresholds in the same sparse normal means model.
[/example]
The example shows why a course on lower bounds cannot treat the loss as a secondary detail. To compare all estimators under a fixed loss and parameter class, we need a single benchmark that records the best possible worst-case performance; this is the minimax risk.
[definition: Minimax Risk]
For a statistical experiment $(\nu_\theta)_{\theta \in \Theta}$, target functional $T: \Theta \to \mathcal T$, and loss $L$, the minimax risk is
\begin{align*}
\mathcal R_n(\Theta, L) = \inf_{\hat T} \sup_{\theta \in \Theta} \mathbb E_\theta[L(\hat T(X), T(\theta))],
\end{align*}
where the infimum is over all measurable estimators $\hat T$.
[/definition]
The subscript $n$ indicates that the experiment usually depends on the sample size. A minimax upper bound constructs an estimator and controls its risk; a minimax lower bound proves that every estimator has risk at least a certain size.
## From Estimation To Testing
A lower bound becomes manageable when a continuum of possible parameters is replaced by a finite set of alternatives that remain hard to distinguish. The problem is then converted into multiple hypothesis testing: if an estimator were too accurate, it would identify which alternative generated the data. Information inequalities rule out that level of identification when the corresponding probability measures are close.
[definition: Packing]
Let $(\Theta, \rho)$ be a [metric space](/page/Metric%20Space). A subset $\{\theta_1, \dots, \theta_M\} \subset \Theta$ is an $\varepsilon$-packing if $\rho(\theta_i, \theta_j) \ge \varepsilon$ for all $i \ne j$.
[/definition]
Packings are the geometric input into testing reductions, but separation alone is not yet a lower bound. The obstruction is that an estimator returns a point in the parameter space, while the reduced problem asks for a discrete label. What is needed is a formal bridge: if an estimator had risk much smaller than the packing radius, then it would implicitly identify the correct packed point. The following result packages that bridge and makes clear why the metric structure and packing radius are part of the hypothesis.
[quotetheorem:5889]
[citeproof:5889]
This theorem is the main bridge between geometry and information, but its metric hypothesis is part of the statistical content rather than a technical ornament. For metric losses, small estimation error has an unambiguous meaning relative to a separated packing: the estimate identifies one alternative more naturally than the others. For non-metric losses, that interpretation can fail, so the same lower-bound conclusion may require a problem-specific way to convert an estimator into a test. The next chapters repeatedly use this result with two points, hypercubes, Hamming packings, and metric entropy constructions.
[example: Bounded Gaussian Mean]
Consider $X \sim \mathcal N(\theta,\sigma^2 I_d)$ with $\theta \in B(0,r)$ and squared error loss. Take
\begin{align*}
\theta_1=0,\qquad \theta_2=h e_1,\qquad 0<h\le r.
\end{align*}
Then both parameters lie in $B(0,r)$, and their squared separation is
\begin{align*}
|\theta_2-\theta_1|^2=|h e_1|^2=h^2.
\end{align*}
Let $P_j=\mathcal N(\theta_j,\sigma^2 I_d)$. The likelihood ratio is
\begin{align*}
\log\frac{dP_1}{dP_2}(x)=-\frac{|x|^2}{2\sigma^2}+\frac{|x-h e_1|^2}{2\sigma^2}.
\end{align*}
Expanding the two squared norms gives
\begin{align*}
|x-h e_1|^2=(x_1-h)^2+\sum_{k=2}^d x_k^2=x_1^2-2h x_1+h^2+\sum_{k=2}^d x_k^2.
\end{align*}
Since $|x|^2=x_1^2+\sum_{k=2}^d x_k^2$, subtraction yields
\begin{align*}
|x-h e_1|^2-|x|^2=h^2-2h x_1.
\end{align*}
Therefore
\begin{align*}
\log\frac{dP_1}{dP_2}(x)=\frac{h^2-2h x_1}{2\sigma^2}.
\end{align*}
Under $P_1$, $X_1\sim \mathcal N(0,\sigma^2)$, so $\mathbb E_1[X_1]=0$. Hence
\begin{align*}
D_{\mathrm{KL}}(P_1\|P_2)=\mathbb E_1\left[\frac{h^2-2hX_1}{2\sigma^2}\right]=\frac{h^2-2h\mathbb E_1[X_1]}{2\sigma^2}=\frac{h^2}{2\sigma^2}.
\end{align*}
Given any estimator $\hat\theta$, form the decoder $\hat J$ that chooses the closer of $\theta_1$ and $\theta_2$, with deterministic tie-breaking. If the true parameter is $\theta_j$ and $|\hat\theta-\theta_j|<h/2$, then the triangle inequality gives
\begin{align*}
|\hat\theta-\theta_{3-j}|\ge |\theta_1-\theta_2|-|\hat\theta-\theta_j|>h-\frac h2=\frac h2.
\end{align*}
Thus the decoder is correct on this event, so decoder error implies $|\hat\theta-\theta_j|\ge h/2$. Consequently
\begin{align*}
\mathbb E_j|\hat\theta-\theta_j|^2\ge \frac{h^2}{4}\mathbb P_j(\hat J\ne j).
\end{align*}
The two-point total-variation testing bound gives
\begin{align*}
\inf_{\hat J}\max_{j=1,2}\mathbb P_j(\hat J\ne j)\ge \frac{1-\|P_1-P_2\|_{\mathrm{TV}}}{2}.
\end{align*}
By *[Pinsker Inequality](/theorems/5890)* and the KL computation above,
\begin{align*}
\|P_1-P_2\|_{\mathrm{TV}}\le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(P_1\|P_2)}=\sqrt{\frac{h^2}{4\sigma^2}}=\frac{h}{2\sigma}.
\end{align*}
Choosing $h=\min\{r,\sigma\}$ gives $h/(2\sigma)\le 1/2$, hence
\begin{align*}
\inf_{\hat J}\max_{j=1,2}\mathbb P_j(\hat J\ne j)\ge \frac{1-1/2}{2}=\frac14.
\end{align*}
Combining the estimation-to-testing implication with this testing lower bound gives
\begin{align*}
\inf_{\hat\theta}\sup_{\theta\in B(0,r)}\mathbb E_\theta|\hat\theta-\theta|^2\ge \frac{h^2}{4}\cdot \frac14=\frac{1}{16}\min\{r^2,\sigma^2\}.
\end{align*}
This proves a one-coordinate lower bound: even before using the remaining $d-1$ directions, squared-error estimation over the ball costs at least a constant multiple of $\min\{r^2,\sigma^2\}$, while a full $d$-dimensional packing repeats this separation across many directions to obtain the dimension-dependent rate.
[/example]
The bounded mean example is intentionally simple: it shows the mechanism before sparse combinatorics enters. In sparse normal means, the packing must also encode support choices, which is where Gilbert-Varshamov-type binary packings enter the course.
## Information Inequalities As Lower-Bound Engines
Once estimation has been reduced to testing, the remaining question is probabilistic: how close are the induced laws? The course uses several notions of discrepancy between probability measures, each suited to a different lower-bound argument. Total variation controls binary testing directly; Kullback-Leibler divergence tensorizes well; chi-squared divergence is useful for mixtures and second-moment arguments.
[definition: Total Variation And Kullback-Leibler Divergence]
Let $P$ and $Q$ be probability measures on $(\mathcal X, \mathcal A)$. The total variation distance is
\begin{align*}
\|P - Q\|_{\mathrm{TV}} = \sup_{A \in \mathcal A} |P(A) - Q(A)|.
\end{align*}
If $P \ll Q$, the Kullback-Leibler divergence is
\begin{align*}
D_{\mathrm{KL}}(P\|Q) = \int_{\mathcal X} \log\left(\frac{dP}{dQ}\right)\,dP.
\end{align*}
[/definition]
These quantities are not interchangeable, but inequalities between them allow a bound in one divergence to imply a testing lower bound. The practical tension is that Gaussian and product models usually give accessible Kullback-Leibler calculations, while the testing reductions above are stated in total variation. The next result is the standard conversion that lets a computable information quantity certify indistinguishability.
[quotetheorem:5890]
[citeproof:5890]
Pinsker's inequality is powerful because Kullback-Leibler divergence often has a closed form in Gaussian models. For product experiments, it also adds across independent samples, matching the way information accumulates with $n$. The absolute-continuity hypothesis marks the boundary of this usefulness: when $P$ puts mass where $Q$ puts none, the Kullback-Leibler divergence is infinite and no finite closeness conclusion follows. Pinsker is also one-sided as a lower-bound tool: small KL divergence forces small total variation, but large or asymmetric KL divergence does not by itself give a sharp testing statement, and in mixture problems chi-squared or Hellinger comparisons may be more informative.
[example: KL Divergence Between Gaussian Shifts]
Let $P_\theta = \mathcal N(\theta, \sigma^2 I_d)$ and $P_{\theta'} = \mathcal N(\theta', \sigma^2 I_d)$, where $\sigma>0$. We compute $D_{\mathrm{KL}}(P_\theta\|P_{\theta'})$ from the Gaussian densities
\begin{align*}
p_\theta(x)=\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{|x-\theta|^2}{2\sigma^2}\right)
\end{align*}
and
\begin{align*}
p_{\theta'}(x)=\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{|x-\theta'|^2}{2\sigma^2}\right).
\end{align*}
The normalizing constants cancel in the likelihood ratio, so for every $x\in\mathbb R^d$,
\begin{align*}
\log \frac{p_\theta(x)}{p_{\theta'}(x)}=-\frac{|x-\theta|^2}{2\sigma^2}+\frac{|x-\theta'|^2}{2\sigma^2}.
\end{align*}
Equivalently,
\begin{align*}
\log \frac{p_\theta(x)}{p_{\theta'}(x)}=\frac{|x-\theta'|^2-|x-\theta|^2}{2\sigma^2}.
\end{align*}
Expanding the first squared norm gives
\begin{align*}
|x-\theta'|^2=(x-\theta')\cdot(x-\theta')=|x|^2-2x\cdot\theta'+|\theta'|^2.
\end{align*}
Expanding the second squared norm gives
\begin{align*}
|x-\theta|^2=(x-\theta)\cdot(x-\theta)=|x|^2-2x\cdot\theta+|\theta|^2.
\end{align*}
Subtracting these two identities,
\begin{align*}
|x-\theta'|^2-|x-\theta|^2=\left(|x|^2-2x\cdot\theta'+|\theta'|^2\right)-\left(|x|^2-2x\cdot\theta+|\theta|^2\right).
\end{align*}
The $|x|^2$ terms cancel, leaving
\begin{align*}
|x-\theta'|^2-|x-\theta|^2=2x\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2.
\end{align*}
Therefore, under $X\sim P_\theta$,
\begin{align*}
D_{\mathrm{KL}}(P_\theta\|P_{\theta'})=\mathbb E_\theta\left[\log \frac{p_\theta(X)}{p_{\theta'}(X)}\right].
\end{align*}
Substituting the likelihood-ratio formula,
\begin{align*}
D_{\mathrm{KL}}(P_\theta\|P_{\theta'})=\frac{1}{2\sigma^2}\mathbb E_\theta\left[2X\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2\right].
\end{align*}
By linearity of expectation,
\begin{align*}
D_{\mathrm{KL}}(P_\theta\|P_{\theta'})=\frac{1}{2\sigma^2}\left(2\mathbb E_\theta[X]\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2\right).
\end{align*}
Since $X\sim\mathcal N(\theta,\sigma^2 I_d)$, $\mathbb E_\theta[X]=\theta$, and hence
\begin{align*}
D_{\mathrm{KL}}(P_\theta\|P_{\theta'})=\frac{1}{2\sigma^2}\left(2\theta\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2\right).
\end{align*}
Expanding the remaining [inner product](/page/Inner%20Product),
\begin{align*}
2\theta\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2=2|\theta|^2-2\theta\cdot\theta'+|\theta'|^2-|\theta|^2.
\end{align*}
Combining like terms,
\begin{align*}
2|\theta|^2-2\theta\cdot\theta'+|\theta'|^2-|\theta|^2=|\theta|^2-2\theta\cdot\theta'+|\theta'|^2.
\end{align*}
Finally,
\begin{align*}
|\theta|^2-2\theta\cdot\theta'+|\theta'|^2=|\theta-\theta'|^2.
\end{align*}
Thus
\begin{align*}
D_{\mathrm{KL}}(P_\theta\|P_{\theta'})=\frac{|\theta-\theta'|^2}{2\sigma^2}.
\end{align*}
For $n$ independent samples $X_1,\dots,X_n$, the joint density under parameter $\theta$ is $\prod_{i=1}^n p_\theta(X_i)$, so
\begin{align*}
\log \frac{\prod_{i=1}^n p_\theta(X_i)}{\prod_{i=1}^n p_{\theta'}(X_i)}=\sum_{i=1}^n \log\frac{p_\theta(X_i)}{p_{\theta'}(X_i)}.
\end{align*}
Taking expectation under $P_\theta^{\otimes n}$ and using linearity of expectation,
\begin{align*}
D_{\mathrm{KL}}(P_\theta^{\otimes n}\|P_{\theta'}^{\otimes n})=\sum_{i=1}^n D_{\mathrm{KL}}(P_\theta\|P_{\theta'}).
\end{align*}
Therefore
\begin{align*}
D_{\mathrm{KL}}(P_\theta^{\otimes n}\|P_{\theta'}^{\otimes n})=\frac{n|\theta-\theta'|^2}{2\sigma^2}.
\end{align*}
Euclidean separation is therefore exactly the scale of statistical distinguishability in this normal mean model, and independent samples add information linearly in $n$.
[/example]
The same logic underlies covariance estimation and regression, but the algebra becomes matrix-valued. This motivates the random matrix half of the course.
## Random Matrices And Spectral Phenomena
Many high-dimensional estimators are functions of empirical covariance matrices, Gram matrices, or noise matrices. In low-dimensional asymptotics these matrices converge to their population counterparts in operator norm, but this need not happen when $d/n$ has a non-zero limit. Random matrix theory gives the replacement deterministic objects and the fluctuation scales needed for statistical conclusions.
[definition: Sample Covariance Matrix]
Let $X_1, \dots, X_n$ be independent random vectors in $\mathbb R^d$ with $\mathbb E[X_i] = 0$ and covariance matrix $\Sigma = \mathbb E[X_i X_i^\top]$. The sample covariance matrix is
\begin{align*}
\hat\Sigma = \frac{1}{n}\sum_{i=1}^{n} X_i X_i^\top.
\end{align*}
[/definition]
The central issue is not only entrywise convergence of $\hat\Sigma$ but also spectral convergence. Operator norm errors control principal components, covariance regularisation, and many prediction guarantees.
[example: Isotropic Gaussian Sample Covariance]
Let $X_1,\dots,X_n\sim \mathcal N(0,I_d)$ be independent and let
\begin{align*}
\hat\Sigma=\frac1n\sum_{i=1}^n X_iX_i^\top .
\end{align*}
Each coordinate has mean $0$, different coordinates are uncorrelated, and each coordinate has variance $1$, so
\begin{align*}
\mathbb E[X_iX_i^\top]=I_d.
\end{align*}
Therefore
\begin{align*}
\mathbb E[\hat\Sigma]
=\mathbb E\left[\frac1n\sum_{i=1}^n X_iX_i^\top\right]
=\frac1n\sum_{i=1}^n \mathbb E[X_iX_i^\top]
=\frac1n\sum_{i=1}^n I_d
=I_d.
\end{align*}
This identity says that $\hat\Sigma$ is entrywise centered around the population covariance, but it does not imply that all eigenvalues are close to $1$ when $d$ is comparable to $n$.
Assume $d/n\to \gamma\in(0,\infty)$. By the *[Marchenko-Pastur Theorem](/theorems/4070)*, the empirical spectral distribution of $\hat\Sigma$ converges to the Marchenko-Pastur law with endpoints
\begin{align*}
a_\gamma=(1-\sqrt{\gamma})^2,
\qquad
b_\gamma=(1+\sqrt{\gamma})^2.
\end{align*}
For $0<\gamma\le 1$, the limiting eigenvalue mass lies on
\begin{align*}
[a_\gamma,b_\gamma]
=
[(1-\sqrt{\gamma})^2,(1+\sqrt{\gamma})^2].
\end{align*}
The upper edge differs from the population eigenvalue $1$ by
\begin{align*}
b_\gamma-1
=(1+\sqrt{\gamma})^2-1
=1+2\sqrt{\gamma}+\gamma-1
=2\sqrt{\gamma}+\gamma,
\end{align*}
which is positive for every $\gamma>0$. For example, if $\gamma=1$, then
\begin{align*}
a_1=(1-1)^2=0,
\qquad
b_1=(1+1)^2=4,
\end{align*}
so the top eigenvalues live near $4$, not near $1$.
If $\gamma>1$, then eventually $d>n$, and the matrix $\hat\Sigma$ has rank at most $n$ because it is a sum of $n$ rank-one matrices:
\begin{align*}
\operatorname{rank}(\hat\Sigma)
=
\operatorname{rank}\left(\frac1n\sum_{i=1}^n X_iX_i^\top\right)
\le n.
\end{align*}
Hence at least $d-n$ of its $d$ eigenvalues are exactly $0$, and the fraction of zero eigenvalues satisfies
\begin{align*}
\frac{d-n}{d}
=1-\frac nd
\to 1-\frac1\gamma.
\end{align*}
Thus the limiting law has an atom of mass $1-1/\gamma$ at $0$ when $\gamma>1$.
The sample covariance is unbiased, but in proportional dimension its spectrum is spread over the Marchenko-Pastur scale, with largest eigenvalue tending to $(1+\sqrt{\gamma})^2$ rather than to $1$.
[/example]
This phenomenon explains why random matrices are not an optional technical appendix. To define a concrete model for low-rank signal hidden inside high-dimensional spectral noise, the course introduces the following spiked covariance model.
[definition: Spiked Covariance Model]
In the rank-one spiked covariance model, observations $X_1, \dots, X_n \in \mathbb R^d$ are independent with
\begin{align*}
X_i \sim \mathcal N(0, \Sigma), \qquad \Sigma = I_d + \lambda v v^\top,
\end{align*}
where $\lambda > 0$ and $v \in \mathbb R^d$ satisfies $|v| = 1$.
[/definition]
The model asks whether the leading empirical eigenvector reveals $v$ and whether the leading eigenvalue separates from the noise bulk. In proportional asymptotics, both questions have sharp threshold behaviour.
[quotetheorem:5891]
This result is stated here as a course landmark rather than a first-week theorem. Its assumptions are part of the message: the clean threshold uses Gaussian observations, a rank-one spike, and proportional asymptotics with $d/n$ approaching a positive constant. Eigenvalue separation is also not the same as full statistical recovery; it describes the behaviour of the leading empirical eigenpair, while other procedures or stronger structural assumptions may lead to different detection and estimation questions. Chapter 9 proves the precise BBP transition, after Chapters 7 and 8 develop the Marchenko-Pastur edge and PCA perturbation tools, and connects it to the minimax detection ideas developed in Chapter 10.
## Thresholds, Rates, And The Shape Of The Course
The course repeatedly returns to a common pattern: construct an estimator, prove it achieves a rate, then prove no estimator improves that rate over the same parameter class. When the two bounds match up to constants or logarithmic factors, the course identifies the statistical scale of the problem. When they do not match, the gap often points to missing geometry, computational constraints, or a more refined random matrix phenomenon.
[explanation: Main Course Themes]
The first part of the course builds lower-bound tools. Le Cam's method handles two carefully chosen alternatives, Fano's inequality handles large packings, and Assouad's lemma handles hypercube structures where many local testing problems combine into a global estimation lower bound.
The middle part applies these tools to canonical high-dimensional models: Gaussian sequence models, sparse normal means, covariance estimation, linear regression, and matrix estimation. The recurring question is how dimension $d$, sample size $n$, sparsity $s$, rank $r$, and noise level $\sigma$ enter the minimax risk.
The final part develops random matrix theory for statistical problems. Marchenko-Pastur limits, spectral norm bounds, spiked models, restricted isometry phenomena, and compressed sensing thresholds explain why convex and spectral procedures succeed in some regimes and fail in others.
[/explanation]
The prerequisite material is used throughout rather than reviewed in isolation. Measure-theoretic probability supplies the language of experiments and expectations, concentration inequalities control upper bounds and spectral deviations, linear algebra supplies eigenvalue and singular value tools, and decision theory gives the minimax framework.
[remark: Conventions For Asymptotic Statements]
Unless stated otherwise, constants may depend on fixed distributional or geometric parameters but not on $n$, $d$, $s$, $r$, or the noise level. Statements such as $a_n \lesssim b_n$ are uniform over the parameter class under discussion, with the relevant dependence stated near the result. Probability bounds are interpreted under the data-generating law specified by the current model.
[/remark]
These notes are written to make lower bounds and random matrix calculations usable rather than mysterious. Each major method will be introduced first through a small model where the calculation can be seen directly, then extended to the high-dimensional setting where geometry and probability interact.
The introduction has now set the stage by explaining why lower bounds and random matrix calculations are central tools rather than isolated tricks. The next chapter turns that motivation into a formal decision-theoretic language for measuring what can and cannot be recovered as dimension grows.
# 1. Statistical Decision Theory in High Dimension
This opening chapter fixes the language in which high-dimensional minimax theory is stated. The central question is how accurately an unknown parameter can be recovered when its dimension, sparsity, or ambient matrix size is allowed to grow with the sample size. The prerequisites are basic probability, expectation and variance, Gaussian random vectors, elementary metric-space language, and finite-dimensional linear algebra. We begin with statistical decision problems, then turn to finite metric constructions that convert estimation into testing, and finish with benchmark models that will reappear throughout the course.
## Loss, Risk, and Minimax Formulation
What does it mean for an estimator to be optimal when the parameter space contains many possible high-dimensional signals? Pointwise accuracy is too weak, because an estimator may perform well at a convenient parameter and poorly elsewhere. The minimax viewpoint asks for a procedure whose worst-case expected loss is as small as possible over the whole parameter class.
A statistical model specifies a family of probability measures indexed by the unknown parameter.
[definition: Statistical Model]
Let $\Theta$ be a parameter space and let $(\mathcal X, \mathcal A)$ be a measurable sample space. A statistical model is a family $\{\mathbb P_\theta : \theta \in \Theta\}$ of probability measures on $(\mathcal X, \mathcal A)$. The observed data $X$ have distribution $\mathbb P_\theta$ for an unknown $\theta \in \Theta$.
[/definition]
The parameter space is part of the statistical problem, not only a technical domain. In high dimension, $\Theta$ often encodes structure such as bounded Euclidean norm, sparsity, low rank, or positive-definiteness. The following example gives the model that will serve as the basic test case for the rest of the chapter.
[example: Gaussian Mean Model]
Let $X \sim \mathcal N(\theta,\sigma^2 I_d)$ with known $\sigma>0$ and unknown $\theta\in\Theta\subset\mathbb R^d$. Equivalently,
\begin{align*}
X=\theta+\sigma Z
\end{align*}
where $Z\sim\mathcal N(0,I_d)$. Therefore $\mathbb E_\theta[X]=\theta$ and $\operatorname{Cov}_\theta(X)=\sigma^2 I_d$.
The same Gaussian observation model leads to different statistical problems depending on $\Theta$. If $\Theta=\mathbb R^d$, then every vector in $\mathbb R^d$ is allowed and no structural constraint is imposed. If
\begin{align*}
\Theta=\{\theta\in\mathbb R^d:|\theta|\le R\},
\end{align*}
then every admissible signal satisfies
\begin{align*}
|\theta|^2=\sum_{j=1}^d \theta_j^2\le R^2,
\end{align*}
so the signal energy is bounded. If
\begin{align*}
\Theta=\{\theta\in\mathbb R^d:|\{j:\theta_j\ne0\}|\le s\},
\end{align*}
then at most $s$ coordinates of $\theta$ are nonzero, so the signal is $s$-sparse. Thus the distributional noise level is fixed by $\sigma^2 I_d$, while the parameter space determines whether the problem is unrestricted, energy-bounded, or sparse.
[/example]
The example shows that the same observation model can support several inferential goals. Estimating the whole vector, predicting a linear response, and identifying the nonzero coordinates are different tasks.
This creates a bookkeeping problem that the model alone cannot solve. Before comparing procedures, we must specify both the kind of object a procedure is allowed to output and the numerical penalty assigned to each possible error. The following definition separates these two pieces of structure from the probability model itself.
[definition: Estimator and Loss]
Let $\{\mathbb P_\theta : \theta \in \Theta\}$ be a statistical model and let $(\mathcal T, \mathcal G)$ be an action space. An estimator is a measurable map $\hat{\theta}: \mathcal X \to \mathcal T$. A loss function is a measurable map $L: \Theta \times \mathcal T \to [0, \infty)$.
[/definition]
For numerical parameters the action space is usually the same as the parameter space, but this is not required. In support recovery, for instance, the action is a subset of coordinates rather than a vector of real amplitudes. We therefore need a small catalogue of losses that match the most common high-dimensional objectives.
[definition: Common High-Dimensional Losses]
The squared error loss on $\mathbb R^d$ is the map
\begin{align*}
L_2: \mathbb R^d \times \mathbb R^d &\to [0,\infty), & L_2(\theta,\hat\theta) &= |\hat\theta-\theta|^2.
\end{align*}
For a design matrix $A\in\mathbb R^{n\times d}$, the prediction loss is the map
\begin{align*}
L_A: \mathbb R^d \times \mathbb R^d &\to [0,\infty), & L_A(\theta,\hat\theta) &= \frac{1}{n}|A(\hat\theta-\theta)|^2.
\end{align*}
The Hamming loss is the map
\begin{align*}
d_H: \{0,1\}^d \times \{0,1\}^d &\to \{0,1,\dots,d\}, & d_H(u,v) &= \sum_{j=1}^d \mathbb{1}_{\{u_j\ne v_j\}}.
\end{align*}
The exact support recovery loss is the map
\begin{align*}
L_{\mathrm{exact}}: \mathcal P(\{1,\dots,d\})\times \mathcal P(\{1,\dots,d\}) &\to \{0,1\}, & L_{\mathrm{exact}}(S,T) &= \mathbb{1}_{\{S\ne T\}}.
\end{align*}
The symmetric-difference support loss is the map
\begin{align*}
L_\triangle: \mathcal P(\{1,\dots,d\})\times \mathcal P(\{1,\dots,d\}) &\to \{0,1,\dots,d\}, & L_\triangle(S,T) &= |S\triangle T|.
\end{align*}
[/definition]
These losses separate estimation, prediction, and model-selection goals. A rate under squared error need not imply a useful support recovery guarantee, because a small Euclidean error can still include many small coordinate mistakes. To compare procedures under any one of these losses, the expected loss is measured uniformly over the parameter space.
[definition: Risk and Minimax Risk]
Let $\{\mathbb P_\theta : \theta \in \Theta\}$ be a statistical model on $(\mathcal X,\mathcal A)$, let $(\mathcal T,\mathcal G)$ be an action space, let $\hat\theta:\mathcal X\to\mathcal T$ be a measurable estimator, and let $L:\Theta\times\mathcal T\to[0,\infty)$ be a measurable loss. The risk of $\hat\theta$ is the map $R_{\hat\theta}:\Theta\to[0,\infty]$ defined by
\begin{align*}
R_{\hat\theta}(\theta) = \mathbb E_\theta[L(\theta, \hat\theta(X))].
\end{align*}
The worst-case risk functional over $\Theta$ is the map from measurable estimators $\hat\theta:\mathcal X\to\mathcal T$ to $[0,\infty]$ defined by
\begin{align*}
R_\Theta(\hat\theta) = \sup_{\theta \in \Theta} R_{\hat\theta}(\theta).
\end{align*}
The minimax risk functional assigns to each model, parameter space, action space, and loss a number in $[0,\infty]$:
\begin{align*}
\mathfrak M(\Theta, L) = \inf_{\hat\theta} \sup_{\theta \in \Theta} \mathbb E_\theta[L(\theta, \hat\theta(X))],
\end{align*}
where the infimum is over all measurable estimators with values in the action space.
[/definition]
The minimax risk is the benchmark against which concrete estimators are judged. Upper bounds come from constructing estimators; lower bounds come from proving that no estimator can do better. The notation also keeps track of which loss is being studied.
[remark: Dependence on the Loss]
The notation $\mathfrak M(\Theta, L)$ suppresses the statistical model. In these notes the model will be clear from context, but the loss remains visible because the same parameter space can have different minimax rates under different losses.
[/remark]
A first calibration is the unrestricted Gaussian mean problem. It has no structural constraint, so the only obstruction is Gaussian noise in each coordinate. This theorem sets the baseline against which all later high-dimensional improvements are measured.
[quotetheorem:5892]
[citeproof:5892]
This result identifies the noise level that every structured example must improve upon. Each hypothesis in the theorem is doing work. The unrestricted parameter space prevents shrinkage from improving the worst-case risk; on a bounded ball, the constant estimator already gives risk at most the squared radius. The known Gaussian variance makes the coordinatewise noise scale fixed; with unknown variance or non-Gaussian noise, the exact constant $d\sigma^2$ need not be the right benchmark. Squared error is also essential: under absolute error the rate is of order $d\sigma$ rather than $d\sigma^2$. The bounded ball example gives the first instance of how a structural constraint changes this baseline.
[example: Bounded Gaussian Mean Upper Bound]
Let $X \sim \mathcal N(\theta,\sigma^2 I_d)$ and let $\Theta=\{\theta\in\mathbb R^d:|\theta|\le R\}$. We compare two estimators under squared error loss.
For the estimator $\hat\theta=X$, write $X=\theta+\sigma Z$ with $Z\sim\mathcal N(0,I_d)$. Then
\begin{align*}
R_{\hat\theta}(\theta)=\mathbb E_\theta[|X-\theta|^2]=\mathbb E[|\sigma Z|^2]=\sigma^2\mathbb E\left[\sum_{j=1}^d Z_j^2\right].
\end{align*}
By linearity of expectation and $\mathbb E[Z_j^2]=1$ for each standard normal coordinate,
\begin{align*}
\sigma^2\mathbb E\left[\sum_{j=1}^d Z_j^2\right]=\sigma^2\sum_{j=1}^d \mathbb E[Z_j^2]=\sigma^2\sum_{j=1}^d 1=d\sigma^2.
\end{align*}
This value does not depend on $\theta$, so
\begin{align*}
\sup_{\theta\in\Theta}R_{\hat\theta}(\theta)=d\sigma^2.
\end{align*}
For the constant estimator $\tilde\theta=0$, the estimator is non-random, hence
\begin{align*}
R_{\tilde\theta}(\theta)=\mathbb E_\theta[|0-\theta|^2]=|\theta|^2.
\end{align*}
Since $\theta\in\Theta$ implies $|\theta|\le R$,
\begin{align*}
\sup_{\theta\in\Theta}R_{\tilde\theta}(\theta)=\sup_{\theta\in\Theta}|\theta|^2\le R^2.
\end{align*}
The minimax risk is the infimum of the worst-case risk over all estimators, so it is no larger than the worst-case risk of either estimator. Therefore
\begin{align*}
\mathfrak M(\Theta,|\cdot|^2)\le \min\{d\sigma^2,R^2\}.
\end{align*}
This displays the two competing upper-bound scales: the sampling-noise scale $d\sigma^2$ from using the observation itself, and the radius scale $R^2$ from knowing that the whole parameter space lies in the ball.
[/example]
## Packing, Covering, and Metric Entropy
How can a lower bound for estimation be proved without analysing every estimator directly? The standard method is to find many well-separated parameters whose induced distributions are hard to distinguish. Geometry supplies separation through packings, while probability supplies indistinguishability through testing bounds.
The first geometric object records how many hypotheses can be placed in the parameter space with a prescribed minimum separation.
[definition: Packing Number]
Let $(T,d)$ be a metric space and let $\varepsilon > 0$. A subset $\{t_1,\dots,t_M\} \subset T$ is an $\varepsilon$-packing if
\begin{align*}
d(t_i,t_j) > \varepsilon \quad \text{for all } i \ne j.
\end{align*}
For fixed $(T,d)$, the packing number is the map
\begin{align*}
M_{T,d}:(0,\infty)\to \mathbb N\cup\{\infty\}
\end{align*}
defined by letting $M_{T,d}(\varepsilon)$ be the largest cardinality of an $\varepsilon$-packing of $T$, with value $\infty$ if no finite largest cardinality exists. We write this value as $M(\varepsilon,T,d)$.
[/definition]
A large packing says that the parameter space contains many statistically distinct targets, provided the loss dominates the metric used to separate them. Coverings measure the dual question: how many balls are needed to approximate the whole space. This second notion is needed because entropy upper bounds are usually stated through covers.
[definition: Covering Number and Metric Entropy]
Let $(T,d)$ be a metric space and let $\varepsilon > 0$. A subset $\{t_1,\dots,t_N\} \subset T$ is an $\varepsilon$-cover if
\begin{align*}
T \subset \bigcup_{j=1}^N B(t_j,\varepsilon).
\end{align*}
For fixed $(T,d)$, the covering number is the map
\begin{align*}
N_{T,d}:(0,\infty)\to \mathbb N\cup\{\infty\}
\end{align*}
defined by letting $N_{T,d}(\varepsilon)$ be the smallest cardinality of an $\varepsilon$-cover of $T$, with value $\infty$ if no finite cover exists. We write this value as $N(\varepsilon,T,d)$. The metric entropy is the quantity $\log N(\varepsilon,T,d)$, with $\log\infty=\infty$.
[/definition]
Packing and covering numbers differ only by constant-scale changes in radius. This matters because lower-bound arguments naturally use packings while chaining, discretisation, and upper bounds often use covers. The following comparison lets us translate between the two conventions at the level of rates.
[quotetheorem:1095]
[citeproof:1095]
The strict inequality in the packing definition and the factor $2$ in the theorem are constant-scale artefacts, not substantive features of the theory. The factor cannot simply be removed in general: in the interval $[0,1]$, two points at distance just above $\varepsilon$ may lie in one closed ball of radius $\varepsilon$, so an $\varepsilon$-packing need not force an $\varepsilon$-cover to use distinct centres. Maximality is also necessary for the upper inequality; a non-maximal packing may leave entire regions of $T$ uncovered. Finally, the theorem uses the triangle inequality through the statement that one radius-$\varepsilon$ ball cannot contain two points separated by more than $2\varepsilon$; for a non-metric dissimilarity this argument can fail. This comparison says nothing about whether the packed points are statistically distinguishable, which is handled later by testing inequalities.
The comparison makes entropy estimates robust under the small changes in constants that are unavoidable in minimax theory. The next example supplies the entropy scale behind bounded Euclidean mean estimation, where volume is the right measure of complexity.
[example: Euclidean Ball Packing Scale]
Let $B_2^d(R)=\{\theta\in\mathbb R^d:|\theta|\le R\}$, and write $v_d=\operatorname{Vol}(B_2^d(1))$. We show the standard volume bounds
\begin{align*}
\left(\frac{R}{\varepsilon}\right)^d
\le
N(\varepsilon,B_2^d(R),|\cdot|)
\le
M(\varepsilon,B_2^d(R),|\cdot|)
\le
\left(\frac{3R}{\varepsilon}\right)^d,
\end{align*}
which imply
\begin{align*}
d\log(R/\varepsilon)
\le
\log M(\varepsilon,B_2^d(R),|\cdot|)
\le
d\log(3R/\varepsilon).
\end{align*}
Thus, up to universal constant factors in the radius scale,
\begin{align*}
\log M(\varepsilon,B_2^d(R),|\cdot|)\asymp d\log(R/\varepsilon).
\end{align*}
For the upper bound, let $\theta_1,\dots,\theta_M$ be an $\varepsilon$-packing of $B_2^d(R)$. The balls $B(\theta_i,\varepsilon/2)$ are pairwise disjoint: if some $x$ lay in both $B(\theta_i,\varepsilon/2)$ and $B(\theta_j,\varepsilon/2)$, then the triangle inequality would give
\begin{align*}
|\theta_i-\theta_j|
\le |\theta_i-x|+|x-\theta_j|
< \frac{\varepsilon}{2}+\frac{\varepsilon}{2}
=\varepsilon,
\end{align*}
contradicting the packing condition. Also, if $y\in B(\theta_i,\varepsilon/2)$, then
\begin{align*}
|y|\le |y-\theta_i|+|\theta_i|
\le \frac{\varepsilon}{2}+R
\le \frac{3R}{2},
\end{align*}
because $\varepsilon<R$. Hence
\begin{align*}
M v_d\left(\frac{\varepsilon}{2}\right)^d
\le
v_d\left(R+\frac{\varepsilon}{2}\right)^d
\le
v_d\left(\frac{3R}{2}\right)^d,
\end{align*}
and dividing by $v_d(\varepsilon/2)^d$ gives
\begin{align*}
M\le \left(\frac{3R}{\varepsilon}\right)^d.
\end{align*}
For the lower bound, let $x_1,\dots,x_N$ be any $\varepsilon$-cover of $B_2^d(R)$. Then
\begin{align*}
B_2^d(R)\subset \bigcup_{j=1}^N B(x_j,\varepsilon),
\end{align*}
so subadditivity of volume gives
\begin{align*}
v_dR^d
=
\operatorname{Vol}(B_2^d(R))
\le
\sum_{j=1}^N \operatorname{Vol}(B(x_j,\varepsilon))
=
N v_d\varepsilon^d.
\end{align*}
After cancelling $v_d\varepsilon^d$, this gives
\begin{align*}
N(\varepsilon,B_2^d(R),|\cdot|)
\ge
\left(\frac{R}{\varepsilon}\right)^d.
\end{align*}
Since every maximal $\varepsilon$-packing is an $\varepsilon$-cover, we also have
\begin{align*}
N(\varepsilon,B_2^d(R),|\cdot|)
\le
M(\varepsilon,B_2^d(R),|\cdot|).
\end{align*}
The entropy therefore grows linearly in the ambient dimension $d$ and logarithmically in the resolution ratio $R/\varepsilon$, reflecting that Euclidean balls have full $d$-dimensional volume complexity.
[/example]
The Euclidean ball example is governed by volume in the full ambient dimension. Sparse problems have a different source of complexity: the unknown support can be chosen in many ways even when each support has only $s$ coordinates. To state lower bounds for sparsity, we first need a parameter class that records this support constraint.
[definition: Sparse Vector Class]
For $1 \le s \le d$, the $s$-sparse subset of $\mathbb R^d$ is
\begin{align*}
\Theta_s = \{\theta \in \mathbb R^d : |\{j : \theta_j \ne 0\}| \le s\}.
\end{align*}
[/definition]
The sparse class is a union of many coordinate subspaces, so a lower bound needs a finite subset whose supports remain separated. Merely choosing points within one coordinate subspace would miss the combinatorial term $s\log(d/s)$. The Gilbert-Varshamov bound supplies the required collection of supports.
[quotetheorem:5893]
[citeproof:5893]
The restriction $s\le d/2$ ensures that the constant-weight layer has enough unused coordinates to create many separated supports. This hypothesis is not cosmetic: if $s=d$, the layer contains only the all-ones vector, so no exponential packing exists; if $s$ is close to $d$, the right combinatorial scale is better expressed through the number of zeros rather than $s\log(d/s)$. The fixed-weight condition is also important, because mixing many weights can create apparent cardinality without preserving the same sparsity and separation geometry. The conclusion is combinatorial rather than probabilistic: it supplies many alternatives, but it does not by itself prove that those alternatives are hard to distinguish from data.
This theorem is the combinatorial engine behind many sparse lower bounds. By multiplying each binary vector by an amplitude $a>0$, it becomes a Euclidean packing whose statistical separation can be tuned through $a$. The construction also records the Kullback-Leibler scale that will enter testing inequalities.
[example: Sparse Normal Means Packing]
Let $\mathcal V\subset\{0,1\}^d$ be a constant-weight Gilbert-Varshamov packing, so $|v|_0=s$ for every $v\in\mathcal V$ and $d_H(u,v)\ge s/2$ for distinct $u,v\in\mathcal V$, as in *Gilbert-Varshamov Bound*. Fix $a>0$ and define $\theta_v=av$ for $v\in\mathcal V$. Since
\begin{align*}
\{j:(\theta_v)_j\ne0\}=\{j:av_j\ne0\}=\{j:v_j=1\},
\end{align*}
each $\theta_v$ has exactly $s$ nonzero coordinates, and hence $\theta_v\in\Theta_s$.
For distinct $u,v\in\mathcal V$, expand the squared Euclidean distance coordinate by coordinate:
\begin{align*}
|\theta_u-\theta_v|^2=\sum_{j=1}^d (a u_j-a v_j)^2.
\end{align*}
Factoring out $a^2$ gives
\begin{align*}
\sum_{j=1}^d (a u_j-a v_j)^2=a^2\sum_{j=1}^d (u_j-v_j)^2.
\end{align*}
Because $u_j,v_j\in\{0,1\}$, the quantity $(u_j-v_j)^2$ equals $1$ exactly when $u_j\ne v_j$ and equals $0$ exactly when $u_j=v_j$. Therefore
\begin{align*}
a^2\sum_{j=1}^d (u_j-v_j)^2=a^2\sum_{j=1}^d \mathbb 1_{\{u_j\ne v_j\}}=a^2d_H(u,v).
\end{align*}
Using the Gilbert-Varshamov separation $d_H(u,v)\ge s/2$, we obtain
\begin{align*}
|\theta_u-\theta_v|^2\ge \frac{a^2s}{2}.
\end{align*}
In the Gaussian mean model $X\sim\mathcal N(\theta,\sigma^2I_d)$, the equal-covariance Gaussian KL formula gives
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_u,\sigma^2I_d)\,\middle\|\,\mathcal N(\theta_v,\sigma^2I_d)\right)=\frac{1}{2}(\theta_u-\theta_v)^\top(\sigma^2I_d)^{-1}(\theta_u-\theta_v).
\end{align*}
Since $(\sigma^2I_d)^{-1}=\sigma^{-2}I_d$, this becomes
\begin{align*}
\frac{1}{2}(\theta_u-\theta_v)^\top(\sigma^2I_d)^{-1}(\theta_u-\theta_v)=\frac{1}{2\sigma^2}(\theta_u-\theta_v)^\top(\theta_u-\theta_v).
\end{align*}
Finally, $(\theta_u-\theta_v)^\top(\theta_u-\theta_v)=|\theta_u-\theta_v|^2$, so
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_u,\sigma^2I_d)\,\middle\|\,\mathcal N(\theta_v,\sigma^2I_d)\right)=\frac{|\theta_u-\theta_v|^2}{2\sigma^2}.
\end{align*}
Thus the amplitude $a$ simultaneously controls the squared-error separation and the information distance: smaller $a$ makes the Gaussian hypotheses closer in KL divergence, while the packing still gives a squared-error gap at least $a^2s/2$.
[/example]
## From Estimation to Testing
Why do packings imply lower bounds for estimation? If an estimator has small loss uniformly, then on a finite well-separated subset it can be converted into a test that identifies which parameter generated the data. Therefore any lower bound on testing error becomes a lower bound on estimation error.
The two-point version is the simplest and often gives the correct dependence on signal strength. It reduces estimation to testing between two distributions.
[quotetheorem:5894]
[citeproof:5894]
For squared error, if $|\theta_0-\theta_1| \ge 2r$, then the triangle inequality gives $|a-\theta_0|^2+|a-\theta_1|^2 \ge 2r^2$. The loss-separation assumption is necessary: if the same action has zero loss at both $\theta_0$ and $\theta_1$, no lower bound on estimation follows from being unable to distinguish them. The total-variation factor records the opposite limitation. When $\|\mathbb P_{\theta_0}-\mathbb P_{\theta_1}\|_{\mathrm{TV}}$ is close to $1$, a test can almost identify the true point, so the two-point argument gives little information even if the parameters are far apart. The next example performs the required closeness calculation in the Gaussian mean model.
[example: Two Point Gaussian Mean Lower Bound]
Let $h>0$, set $\theta_0=0$ and $\theta_1=he_1$, and use squared error loss. For any action $a=(a_1,\dots,a_d)\in\mathbb R^d$,
\begin{align*}
|a-\theta_0|^2+|a-\theta_1|^2=|a|^2+|a-he_1|^2.
\end{align*}
Expanding the two squared norms coordinate by coordinate gives
\begin{align*}
|a|^2+|a-he_1|^2=\sum_{j=1}^d a_j^2+(a_1-h)^2+\sum_{j=2}^d a_j^2.
\end{align*}
Collecting the first coordinate and the remaining coordinates,
\begin{align*}
\sum_{j=1}^d a_j^2+(a_1-h)^2+\sum_{j=2}^d a_j^2=a_1^2+(a_1-h)^2+2\sum_{j=2}^d a_j^2.
\end{align*}
Completing the square in $a_1$,
\begin{align*}
a_1^2+(a_1-h)^2+2\sum_{j=2}^d a_j^2=2\left(a_1-\frac h2\right)^2+\frac{h^2}{2}+2\sum_{j=2}^d a_j^2.
\end{align*}
Since the squared terms are nonnegative,
\begin{align*}
|a-\theta_0|^2+|a-\theta_1|^2\ge \frac{h^2}{2}.
\end{align*}
Thus the loss-separation parameter in *Le Cam Two Point Method* is $\Delta=h^2/2$.
For the total-variation term, *Pinsker's inequality* gives
\begin{align*}
\left\|\mathcal N(0,\sigma^2I_d)-\mathcal N(he_1,\sigma^2I_d)\right\|_{\mathrm{TV}}\le \sqrt{\frac12 D_{\mathrm{KL}}\!\left(\mathcal N(0,\sigma^2I_d)\,\middle\|\,\mathcal N(he_1,\sigma^2I_d)\right)}.
\end{align*}
The equal-covariance Gaussian divergence formula gives
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(0,\sigma^2I_d)\,\middle\|\,\mathcal N(he_1,\sigma^2I_d)\right)=\frac12(0-he_1)^\top(\sigma^2I_d)^{-1}(0-he_1).
\end{align*}
Because $(\sigma^2I_d)^{-1}=\sigma^{-2}I_d$,
\begin{align*}
\frac12(0-he_1)^\top(\sigma^2I_d)^{-1}(0-he_1)=\frac{1}{2\sigma^2}(-he_1)^\top(-he_1).
\end{align*}
Since $e_1^\top e_1=1$,
\begin{align*}
\frac{1}{2\sigma^2}(-he_1)^\top(-he_1)=\frac{h^2}{2\sigma^2}.
\end{align*}
Substituting this value into Pinsker's inequality gives
\begin{align*}
\left\|\mathcal N(0,\sigma^2I_d)-\mathcal N(he_1,\sigma^2I_d)\right\|_{\mathrm{TV}}\le \sqrt{\frac12\cdot\frac{h^2}{2\sigma^2}}=\frac{h}{2\sigma}.
\end{align*}
Applying *Le Cam Two Point Method* therefore yields
\begin{align*}
\mathfrak M(\Theta,|\cdot|^2)\ge \frac{h^2/2}{4}\left(1-\frac{h}{2\sigma}\right)=\frac{h^2}{8}\left(1-\frac{h}{2\sigma}\right).
\end{align*}
For example, taking $h=\sigma$ gives
\begin{align*}
\mathfrak M(\Theta,|\cdot|^2)\ge \frac{\sigma^2}{8}\left(1-\frac12\right)=\frac{\sigma^2}{16}.
\end{align*}
Thus any parameter space containing two points separated by one noise standard deviation in a coordinate direction already has squared-error minimax risk bounded below by a universal constant times $\sigma^2$.
[/example]
Two points cannot capture the full dimension dependence in many high-dimensional problems. The Gaussian example only produced a one-coordinate obstruction, while sparse and dense problems contain many possible alternatives. Multiple testing reductions exploit a whole packing and produce lower bounds involving the logarithm of the packing size.
[quotetheorem:5895]
[citeproof:5895]
The decoder hypothesis is essential because an arbitrary small-loss action need not name a packing point unless the geometry supplies such a decoding rule. For a concrete failure, take two distinct parameters but use the zero loss $L(\theta,a)=0$ for every action $a$; the testing problem may be hard, yet the minimax estimation risk is zero and no decoder inequality with positive $\delta$ can hold. More subtly, a loss may only estimate a common functional of several packing points, in which case small loss does not determine which point generated the data. In squared-error problems nearest-neighbour decoding usually provides the missing map, while in model-selection problems the decoder may be the estimated support itself. Compared with Le Cam's method, this reduction can retain the logarithm of the number of alternatives, which is where dimension and sparsity enter. Its probabilistic content is still missing until a separate result, such as Fano's inequality, lower bounds the testing error.
This reduction is deliberately abstract: all geometry is hidden in the decoder and the separation parameter $\delta$. In practice, nearest-neighbour decoding on a packing supplies the required map, because choosing the wrong packing point forces the estimator to be far from the truth.
[remark: Nearest Neighbour Decoder]
If $d(\theta_i,\theta_j)>2r$ for all $i\ne j$ and $L(\theta,a)=d(\theta,a)^2$, define $\psi(a)$ to be an index minimising $d(a,\theta_i)$. Whenever $\psi(a)\ne i$, the triangle inequality gives $d(a,\theta_i)>r$, so the loss is at least $r^2$.
[/remark]
Chapter 2 pairs this reduction with Fano's inequality, which lower bounds the multiple-testing error in terms of average information between hypotheses. The present chapter isolates the deterministic estimation-to-testing step.
## Benchmark Minimax Examples
What rates should the general theory recover in familiar high-dimensional models? Three examples serve as reference points: bounded Gaussian means, sparse normal means, and covariance estimation. They show how dimension, entropy, and matrix norms enter minimax risk.
We first complete the bounded Gaussian mean calculation up to constants. The upper bound was obtained by comparing the sample estimator and the zero estimator; the lower bound combines a small Euclidean packing with Gaussian testing.
[quotetheorem:5896]
[citeproof:5896]
The ball constraint matters only through the smaller of its diameter scale and the ambient Gaussian noise scale. Removing the ball changes the answer back to $d\sigma^2$, while replacing the Euclidean ball by a sparse or anisotropic set changes the effective dimension. The Gaussian and squared-error assumptions also matter for the exact form of the benchmark: heavy-tailed noise can make the sample mean fail without robustification, and non-quadratic losses scale differently with $\sigma$. A two-point argument is enough only when the separated points are also statistically close; for larger radii below the dense-noise scale, a product prior is needed to distribute the difficulty across many weak coordinates. Sparse models will replace this dense-coordinate balance by a support-counting balance.
The bounded mean model is dense: all coordinates may contribute. Sparse normal means introduce a different obstruction: the estimator must learn not only the amplitudes but also which support among many possible supports is active. The binary packing constructed above isolates this combinatorial uncertainty, and the lower bound asks how much squared-error risk is forced by the inability to distinguish those supports in Gaussian noise.
[quotetheorem:5897]
[citeproof:5897]
The condition $s\le d/2$ is used through the Gilbert-Varshamov packing; when sparsity is no longer genuine, the dense Gaussian mean rate is the more natural benchmark. The Gaussian variance and squared-error loss determine the scale $\sigma^2$ multiplying the combinatorial term; for exact support recovery the relevant loss and threshold are different. The theorem is a lower bound only, because an upper bound requires a concrete sparse estimator such as thresholding or penalised least squares. The proof also explains the logarithmic factor: it is not a noise variance effect, but the price of not knowing which support among roughly $\exp(s\log(d/s))$ candidates is active.
This theorem previews a pattern that will recur: a lower bound is obtained by balancing separation in loss against information divergence. The same packing can also be used for support recovery if the amplitudes are fixed and the loss is whether the support is identified exactly. The following heuristic states the resulting signal-strength threshold.
[example: Support Recovery Threshold Heuristic]
In the sparse normal means model, suppose $\theta_j\in\{0,a\}$ and exactly $s$ coordinates are nonzero. Let $\mathcal V\subset\{0,1\}^d$ be the constant-weight packing from *Gilbert-Varshamov Bound*, so $|v|_0=s$ for every $v\in\mathcal V$, the supports are separated by
\begin{align*}
d_H(u,v)\ge \frac{s}{2}\quad\text{for distinct }u,v\in\mathcal V,
\end{align*}
and the number of alternatives satisfies
\begin{align*}
\log|\mathcal V|\ge c s\log(d/s).
\end{align*}
For each $v\in\mathcal V$, set $\theta_v=av$. Since $a>0$, the support of $\theta_v$ is exactly $\{j:v_j=1\}$, so exact support recovery over this finite family is the same as identifying which $v$ generated the data.
For distinct $u,v\in\mathcal V$, the squared Euclidean separation is obtained coordinate by coordinate:
\begin{align*}
|\theta_u-\theta_v|^2=\sum_{j=1}^d (a u_j-a v_j)^2.
\end{align*}
Factoring out $a^2$ gives
\begin{align*}
\sum_{j=1}^d (a u_j-a v_j)^2=a^2\sum_{j=1}^d (u_j-v_j)^2.
\end{align*}
Because $u_j,v_j\in\{0,1\}$, we have $(u_j-v_j)^2=\mathbb 1_{\{u_j\ne v_j\}}$, and hence
\begin{align*}
a^2\sum_{j=1}^d (u_j-v_j)^2=a^2 d_H(u,v).
\end{align*}
Using the packing separation,
\begin{align*}
|\theta_u-\theta_v|^2\ge \frac{a^2s}{2}.
\end{align*}
The Gaussian information scale has the same support-counting factor. Comparing $\theta_v$ with the zero mean in the Gaussian mean model,
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_v,\sigma^2I_d)\,\middle\|\,\mathcal N(0,\sigma^2I_d)\right)=\frac{|\theta_v|^2}{2\sigma^2}.
\end{align*}
Since $v$ has exactly $s$ nonzero coordinates,
\begin{align*}
|\theta_v|^2=\sum_{j=1}^d a^2v_j^2=a^2|v|_0=a^2s.
\end{align*}
Therefore
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_v,\sigma^2I_d)\,\middle\|\,\mathcal N(0,\sigma^2I_d)\right)=\frac{a^2s}{2\sigma^2}.
\end{align*}
Thus the information per packed alternative is proportional to $a^2s/\sigma^2$, while the packing entropy is at least $c s\log(d/s)$. If
\begin{align*}
\frac{a^2s}{2\sigma^2}\le \eta c s\log(d/s)
\end{align*}
for a sufficiently small universal constant $\eta>0$, then cancelling $s$ gives
\begin{align*}
\frac{a^2}{\sigma^2}\le 2\eta c\log(d/s).
\end{align*}
In this regime the information is only a small constant fraction of the logarithm of the number of possible supports, so the usual *Fano inequality* testing calculation prevents reliable identification of the support on this packing. The packing obstruction therefore occurs below the scale
\begin{align*}
\frac{a^2}{\sigma^2}\asymp \log(d/s).
\end{align*}
Sharper exact-recovery thresholds may also depend on coordinatewise extreme events and on the sparsity regime, so this calculation identifies the minimax packing scale rather than a full sharp threshold theorem.
[/example]
So far the parameter has been a vector. Random matrix theory enters when the unknown parameter is a covariance matrix and the data consist of many independent random vectors. The next definition fixes the Gaussian covariance model that will be revisited when sample covariance spectra are analysed.
[definition: Covariance Estimation Model]
Let $X_1,\dots,X_n$ be i.i.d. random vectors in $\mathbb R^d$ with $X_i \sim \mathcal N(0,\Sigma)$. The parameter space is a class $\Theta$ of symmetric positive semidefinite matrices $\Sigma \in \mathbb R^{d\times d}$. Let $\mathbb S^d$ denote the measurable space of real symmetric $d\times d$ matrices with its Borel $\sigma$-algebra.
[/definition]
The definition specifies the data-generating family and the matrix-valued parameter space. The most basic estimator for this model is the sample covariance
\begin{align*}
\hat\Sigma(x_1,\dots,x_n)=\frac{1}{n}\sum_{i=1}^n x_ix_i^\top,
\end{align*}
which is a measurable map from $(\mathbb R^d)^n$ to $\mathbb S^d$. Its performance will be compared with minimax lower bounds rather than assumed to be optimal from the definition alone.
For matrices, the norm determines which aspect of the covariance is being estimated. Entrywise aggregate accuracy and worst-direction accuracy lead to different rates, so we introduce both losses before stating the benchmark theorem.
[definition: Frobenius and Operator Norm Losses]
Let $\mathbb S^d$ denote the set of real symmetric $d\times d$ matrices. The Frobenius loss is the map
\begin{align*}
L_F: \mathbb S^d\times\mathbb S^d &\to [0,\infty), & L_F(A,B) &= \|A-B\|_F^2 = \sum_{i,j=1}^d (A_{ij}-B_{ij})^2.
\end{align*}
The operator norm loss is the map
\begin{align*}
L_{\mathrm{op}}: \mathbb S^d\times\mathbb S^d &\to [0,\infty), & L_{\mathrm{op}}(A,B) &= \|A-B\|_{\mathrm{op}}^2.
\end{align*}
[/definition]
These two losses make different parts of the covariance matrix statistically visible. Frobenius loss charges the accumulation of many small entrywise errors, whereas operator norm loss charges the largest directional distortion. The next benchmark theorem is needed to anchor the later random-matrix analysis of sample covariance matrices.
[quotetheorem:5898]
[citeproof:5898]
The positive-semidefinite constraint and the upper bound $\Sigma\preceq I_d$ are essential for the displayed caps; without a bounded covariance class, rescaling $\Sigma$ would make the risk arbitrarily large. For example, if a class contains $t\Sigma$ for all $t>0$ and some nonzero covariance matrix $\Sigma$, then squared Frobenius and squared operator-norm errors both scale like $t^2$ under the corresponding rescaling, so no finite cap can hold uniformly. Gaussianity is also used through the exact covariance divergence and the sample covariance concentration inequality; for heavy-tailed data, the same estimator may need truncation or robustification. Frobenius loss counts many matrix entries and therefore saturates at the diameter scale $d$, while operator norm loss measures only the worst direction and has a different dependence on $d/n$. The proof sketch separates inputs that will be developed later: entrywise variance calculations, Gaussian covariance divergence, matrix packings, and sample covariance concentration. This is why the theorem is a benchmark rather than the final word on covariance estimation.
This covariance example is included as a signpost for the random matrix half of the course. Later chapters will prove the concentration inputs and study the spectral behaviour of $\hat\Sigma$ in regimes where $d/n$ does not vanish.
[remark: Rates Versus Constants]
Throughout the course, minimax notation such as $\asymp$ suppresses universal constants and sometimes constants depending on fixed distributional parameters. The main objects of study are the dependence on $n$, $d$, sparsity $s$, rank, signal strength, and noise level.
[/remark]
Chapter 1 has established the basic minimax framework and the meaning of asymptotic rates in high dimensions. With that language in place, we now move to the main information-theoretic device for proving impossibility results: Fano-type lower bounds based on testing and metric entropy.
# 2. Fano's Inequality and Metric Entropy Lower Bounds
This chapter develops the main information-theoretic lower-bound method used throughout high-dimensional minimax theory. It assumes the Chapter 1 reduction from estimation to testing, the basic language of probability measures and likelihood ratios introduced in Chapters 0 and 1, and the metric notions of separated sets and packings from Chapter 1. Here we learn how to control the error probability of many-way tests through mutual information and Kullback-Leibler divergence. The central message is that a large packing produces a strong lower bound when the observations do not contain enough information to identify which packing point generated the data.
## Multiple Testing as the Core Obstruction
The recurring question is how to turn a hard estimation problem into a hard identification problem. If a parameter space contains many well-separated points, then any estimator with small loss would identify the true point with high probability. Fano's inequality turns this implication around: if the data cannot reliably identify a random index, then the minimax risk cannot be small.
[definition: Kullback-Leibler Divergence]
Let $(\mathcal X,\mathcal A)$ be a measurable space. The Kullback-Leibler divergence is the map
\begin{align*}
D:\{(P,Q):P,Q\text{ probability measures on }(\mathcal X,\mathcal A)\}\to [0,\infty],
\qquad (P,Q)\mapsto D(P\|Q)
\end{align*}
defined as follows. If $P \ll Q$, then
\begin{align*}
D(P \| Q) = \int_{\mathcal X} \log\left(\frac{dP}{dQ}\right)\,dP.
\end{align*}
If $P \not\ll Q$, set $D(P \| Q)=\infty$.
[/definition]
The divergence is not a metric, but it measures how expensive it is to mistake $Q$ for $P$ in likelihood-ratio calculations. In product models it is especially useful because independent samples add KL divergences, so a sample size factor often appears automatically.
[example: KL Divergence Between Gaussian Shifts]
Let $P_\theta=\mathcal N(\theta,\sigma^2 I_d)$ and $P_{\theta'}=\mathcal N(\theta',\sigma^2 I_d)$ on $\mathbb R^d$. Their Lebesgue densities are $p_\theta(x)=(2\pi\sigma^2)^{-d/2}\exp(-|x-\theta|^2/(2\sigma^2))$ and $p_{\theta'}(x)=(2\pi\sigma^2)^{-d/2}\exp(-|x-\theta'|^2/(2\sigma^2))$, so
\begin{align*}
\log\frac{p_\theta(x)}{p_{\theta'}(x)}=\frac{|x-\theta'|^2-|x-\theta|^2}{2\sigma^2}.
\end{align*}
The numerator expands as
\begin{align*}
|x-\theta'|^2=|x|^2-2x\cdot\theta'+|\theta'|^2
\end{align*}
and
\begin{align*}
|x-\theta|^2=|x|^2-2x\cdot\theta+|\theta|^2.
\end{align*}
Subtracting these two identities gives
\begin{align*}
|x-\theta'|^2-|x-\theta|^2=2x\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2.
\end{align*}
Therefore, using $\mathbb E_\theta[X]=\theta$,
\begin{align*}
D(P_\theta\|P_{\theta'})=\mathbb E_\theta\left[\log\frac{p_\theta(X)}{p_{\theta'}(X)}\right].
\end{align*}
Substituting the expanded log-likelihood ratio yields
\begin{align*}
D(P_\theta\|P_{\theta'})=\frac{2\mathbb E_\theta[X]\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2}{2\sigma^2}.
\end{align*}
Since $\mathbb E_\theta[X]=\theta$, the numerator is
\begin{align*}
2\theta\cdot(\theta-\theta')+|\theta'|^2-|\theta|^2=|\theta|^2-2\theta\cdot\theta'+|\theta'|^2.
\end{align*}
Finally,
\begin{align*}
|\theta|^2-2\theta\cdot\theta'+|\theta'|^2=|\theta-\theta'|^2,
\end{align*}
and hence
\begin{align*}
D(P_\theta\|P_{\theta'})=\frac{|\theta-\theta'|^2}{2\sigma^2}.
\end{align*}
For independent observations $X_1,\dots,X_n$, the product density under $\theta$ is $\prod_{i=1}^n p_\theta(X_i)$. Thus the product log-likelihood ratio is the sum of the one-observation log-likelihood ratios:
\begin{align*}
\log\frac{\prod_{i=1}^n p_\theta(X_i)}{\prod_{i=1}^n p_{\theta'}(X_i)}=\sum_{i=1}^n \log\frac{p_\theta(X_i)}{p_{\theta'}(X_i)}.
\end{align*}
Taking expectation under $P_\theta^{\otimes n}$ and using that each $X_i$ has law $P_\theta$ gives
\begin{align*}
D(P_\theta^{\otimes n}\|P_{\theta'}^{\otimes n})=\sum_{i=1}^n D(P_\theta\|P_{\theta'}).
\end{align*}
Therefore
\begin{align*}
D(P_\theta^{\otimes n}\|P_{\theta'}^{\otimes n})=\frac{n|\theta-\theta'|^2}{2\sigma^2}.
\end{align*}
Thus Euclidean separation increases the KL cost quadratically, while independent samples multiply the available information by $n$.
[/example]
The Gaussian calculation shows how to control pairwise distinguishability once two parameters are fixed. A many-way lower bound also needs a quantity describing how much the entire observation reveals about the unknown index, which leads to mutual information.
[definition: Mutual Information]
For random variables $V$ and $X$ on measurable spaces, with joint distribution $P_{V,X}$ and marginals $P_V$, $P_X$, mutual information is the functional
\begin{align*}
I:\{(V,X):P_{V,X}\text{ is a joint law of two random variables}\}\to [0,\infty],
\qquad (V,X)\mapsto I(V;X)
\end{align*}
given by
\begin{align*}
I(V;X)=D(P_{V,X}\|P_V\otimes P_X).
\end{align*}
[/definition]
Mutual information measures how much the observation reduces uncertainty about the index. The next question is quantitative: if $V$ is uniform on many possibilities and $I(V;X)$ is small compared with $\log M$, how often must every decoder fail?
[quotetheorem:1654]
[citeproof:1654]
Fano's inequality is a statement about uncertainty in a uniformly chosen index. Each hypothesis has a specific role. If $V$ is not spread out, the conclusion can fail in the intended form: for example, if $\mathbb P(V=1)=1$, the decoder $\hat V\equiv 1$ has zero error even when $X$ is ignored, so no lower bound proportional to a large alphabet entropy can be true. If $M$ is small, the bound can be numerically empty: for $M=2$ the extra $\log 2$ term already cancels the denominator, so binary testing requires the two-point method from Chapter 1 rather than Fano. The theorem also does not say that every large alphabet is hard; if $X=V$, then $I(V;X)=\log M$ and exact decoding is possible, so the information upper bound is the decisive input.
Thus Fano reduces the lower-bound task to an upper bound on $I(V;X)$. Direct computation of mutual information can be awkward because the marginal law of $X$ is a mixture, so the next lemma gives a computable upper bound in terms of KL divergences to a reference law or, when convenient, in terms of pairwise KL divergences between the candidate laws.
[quotetheorem:5899]
[citeproof:5899]
The average KL bound explains why Fano is a many-hypothesis tool: it rewards large entropy in the index set. In the next estimation reduction, this lemma is the step that replaces the mutual information term in Fano by a computable average over the chosen packing points. Thus a statistical lower bound will have two independent inputs: metric separation supplies the loss scale, and this KL average supplies the information budget. The binary endpoint behaves differently, so it is useful to separate it from the multiway regime before using Fano in estimation.
The reference law $Q$ is not required to be a member of the model, which is useful when a central mixture or a convenient null has smaller average divergence. The hypothesis $D(P_j\|Q)<\infty$ is not cosmetic: if $P_1$ is a point mass at $0$, $P_2$ is a point mass at $1$, and $Q=P_1$, then $D(P_2\|Q)=\infty$, so the bound gives no usable information even though the testing problem is perfectly solvable. The theorem also does not prove that a family is hard; it only provides an upper bound on information once a suitable reference or pairwise KL control has been found. This limitation is why lower-bound constructions usually keep the alternatives in a local likelihood neighbourhood rather than placing them arbitrarily far apart.
[example: Binary Fano Reduces to a Testing Bound]
For $M=2$, write $I_2(V;X)=I(V;X)/\log 2$ when mutual information has been computed with natural logarithms. The quoted theorem is stated in base-$2$ entropy units, so *Fano's Inequality* gives, for every decoder $\hat V=\hat V(X)$,
\begin{align*}
\mathbb P(\hat V\ne V)\ge 1-\frac{I_2(V;X)+1}{\log_2 2}.
\end{align*}
The right-hand side is
\begin{align*}
1-\frac{I_2(V;X)+1}{1}
=1-I_2(V;X)-1.
\end{align*}
Substituting $I_2(V;X)=I(V;X)/\log 2$, this becomes
\begin{align*}
=-\frac{I(V;X)}{\log 2}.
\end{align*}
By the definition of mutual information as a Kullback-Leibler divergence, $I(V;X)\ge 0$, so
\begin{align*}
-\frac{I(V;X)}{\log 2}\le 0.
\end{align*}
Thus binary Fano only proves $\mathbb P(\hat V\ne V)\ge$ a nonpositive number, while every error probability already satisfies $\mathbb P(\hat V\ne V)\ge 0$.
The binary case therefore needs a two-point testing bound from Chapter 1. Fano becomes useful in the many-hypothesis regime because the additive $\log 2$ term is then small compared with $\log M$; for instance, if $M=e^m$, then
\begin{align*}
\frac{\log 2}{\log M}=\frac{\log 2}{m},
\end{align*}
which tends to $0$ as $m$ grows. The useful information in Fano comes from a packing with large entropy $\log M$, not from applying it to a binary test.
[/example]
## From Testing Error to Estimation Risk
The practical problem is not to estimate a discrete index, but to lower-bound risk over a parameter space. A packing converts these tasks: if parameters are separated in loss, then a small-loss estimator identifies the packing point by nearest-neighbour decoding.
[definition: Packing Number]
Let $(T,d)$ be a metric space and let $\varepsilon>0$. A subset $\{t_1,\dots,t_M\}\subset T$ is an $\varepsilon$-packing if $d(t_j,t_k)>\varepsilon$ for all $j\ne k$. The packing number is the map
\begin{align*}
M:(0,\infty)\times\{(T,d):(T,d)\text{ is a metric space}\}\to \mathbb N\cup\{\infty\},
\qquad (\varepsilon,T,d)\mapsto M(\varepsilon,T,d),
\end{align*}
where $M(\varepsilon,T,d)$ is the largest cardinality of an $\varepsilon$-packing in $T$.
[/definition]
A large packing supplies many possible truths, while separation gives the loss scale. We now need a theorem that combines this geometric separation with the information bound from the previous section.
[quotetheorem:5900]
[citeproof:5900]
This reduction is the bridge from information theory back to statistical loss. For squared loss, the same decoding argument is applied after taking square roots of the separation scale, and this is the form used in the main examples.
The $2s$ separation cannot be dropped: if two packing points are closer than $2s$, an estimator may lie within distance $s$ of both and nearest-neighbour decoding no longer certifies the true index. The average KL condition is also a genuine restriction, since widely separated parameters can be statistically easy to distinguish and then produce no lower bound. The theorem is therefore a balancing statement, and the next local version makes that balance easier to enforce by keeping all alternatives near a common centre.
[example: Euclidean Packing Gives a Squared-Risk Bound]
Let $\theta_1,\dots,\theta_M$ be packing points with $|\theta_j-\theta_k|>2s$ for $j\ne k$, let $V$ be the true index, and define $\hat V$ to be a nearest packing point to $\hat\theta$. If $|\hat\theta-\theta_V|<s$, then for every $j\ne V$ the triangle inequality gives
\begin{align*}
|\hat\theta-\theta_j|\ge |\theta_j-\theta_V|-|\hat\theta-\theta_V|>2s-s=s.
\end{align*}
Since $|\hat\theta-\theta_V|<s$, this shows that $\theta_V$ is strictly closer to $\hat\theta$ than every other packing point, so $\hat V=V$. Taking the contrapositive, $\hat V\ne V$ implies $|\hat\theta-\theta_V|\ge s$, and hence
\begin{align*}
|\hat\theta-\theta_V|^2\ge s^2\mathbb 1_{\{\hat V\ne V\}}.
\end{align*}
Averaging over the uniform choice of $V$ gives
\begin{align*}
\frac{1}{M}\sum_{j=1}^M \mathbb E_{\theta_j}[|\hat\theta-\theta_j|^2]=\mathbb E[|\hat\theta-\theta_V|^2]\ge s^2\mathbb P(\hat V\ne V).
\end{align*}
The supremum risk dominates this average risk, so
\begin{align*}
\sup_{\theta}\mathbb E_{\theta}[|\hat\theta-\theta|^2]\ge s^2\mathbb P(\hat V\ne V).
\end{align*}
Combining this inequality with the testing lower bound from *Fano Testing to Estimation Reduction*, under the average KL condition with parameter $\alpha$, yields
\begin{align*}
\sup_{\theta}\mathbb E_{\theta}[|\hat\theta-\theta|^2]\ge s^2\left(1-\alpha-\frac{\log 2}{\log M}\right).
\end{align*}
Thus a Euclidean $2s$-packing converts testing error into a squared-risk lower bound at scale $s^2$, which is why sparse normal means and prediction-error lower bounds track squared separation rather than separation itself.
[/example]
The preceding theorem uses a global finite packing. Many lower bounds become sharper when the packing is placed inside a small KL ball around a base point, so the next principle isolates this local construction.
[quotetheorem:5901]
[citeproof:5901]
The local radius hypothesis is the safeguard against a misleading packing: a parameter space can contain many well-separated points that are also easy to tell apart from the data. Taking a common centre would fail if the KL geometry were highly asymmetric and some alternatives had large divergence from $P_{\theta_0}$. In applications the work is to design a packing whose loss separation grows while its information radius stays below the entropy scale.
## Sparse Packings and the Varshamov-Gilbert Lemma
High-dimensional sparsity creates large combinatorial classes. The main question is how to choose many sparse vectors that are far apart in Hamming or Euclidean distance while keeping their coordinates controlled enough for KL calculations.
[definition: Hamming Distance]
For $d\in\mathbb N$, the Hamming distance is the map
\begin{align*}
d_H:\{0,1\}^d\times\{0,1\}^d\to\{0,\dots,d\}
\end{align*}
defined by
\begin{align*}
d_H(u,v)=|\{j\in\{1,\dots,d\}:u_j\ne v_j\}|.
\end{align*}
[/definition]
Hamming distance is the natural combinatorial separation for support patterns. We need a large family of sparse binary vectors with controlled pairwise Hamming distance, because this family will become the parameter packing in sparse estimation.
[quotetheorem:5902]
[citeproof:5902]
The lemma supplies entropy of the correct order for sparse supports. To use it in a Gaussian model, we embed each binary support vector into $\mathbb R^d$ and translate Hamming distance into Euclidean distance.
The constant-weight condition matters because allowing all vectors of weight at most $k$ would mix support-size effects with separation effects: the all-zero vector has Hamming distance only $1$ from every one-sparse vector, so a naive collection of many low-weight vectors need not be separated at scale $k$. The assumption $k\le d/2$ is the sparse regime in which $\log(d/k)$ is positive and the constant-weight layer has large entropy; at $k=d$, the layer contains only the all-ones vector, so this lemma cannot provide an exponential packing. The lemma also does not choose amplitudes, control KL divergence, or produce a Euclidean lower bound by itself. It supplies the combinatorial skeleton, and the statistical model determines the scale attached to each selected support.
[example: Sparse Binary Vectors Become Euclidean Packings]
Let $\mathcal V\subset\{0,1\}^d$ be a Gilbert-Varshamov set whose vectors have exactly $k$ nonzero coordinates and satisfy $d_H(v,v')\ge k/2$ whenever $v\ne v'$. For a scale $a>0$, define $\theta_v=a v\in\mathbb R^d$. We compute the Euclidean separation of two distinct embedded vectors. Since each coordinate of $v-v'$ belongs to $\{-1,0,1\}$,
\begin{align*}
|\theta_v-\theta_{v'}|^2=|a(v-v')|^2=\sum_{\ell=1}^d a^2(v_\ell-v'_\ell)^2=a^2\sum_{\ell=1}^d (v_\ell-v'_\ell)^2.
\end{align*}
For binary coordinates, $(v_\ell-v'_\ell)^2=1$ exactly when $v_\ell\ne v'_\ell$, and $(v_\ell-v'_\ell)^2=0$ exactly when $v_\ell=v'_\ell$. Therefore
\begin{align*}
\sum_{\ell=1}^d (v_\ell-v'_\ell)^2=|\{\ell:v_\ell\ne v'_\ell\}|=d_H(v,v').
\end{align*}
Combining this identity with the Gilbert-Varshamov separation gives
\begin{align*}
|\theta_v-\theta_{v'}|^2=a^2d_H(v,v')\ge \frac{a^2k}{2}.
\end{align*}
Each $v\in\mathcal V$ has exactly $k$ coordinates equal to $1$ and all remaining coordinates equal to $0$, so $\theta_v=a v$ has the same support and is therefore $k$-sparse. Its squared norm is
\begin{align*}
|\theta_v|^2=\sum_{\ell=1}^d (a v_\ell)^2=a^2\sum_{\ell=1}^d v_\ell^2.
\end{align*}
Because $v_\ell^2=v_\ell$ for $v_\ell\in\{0,1\}$ and $\sum_{\ell=1}^d v_\ell=k$, this becomes
\begin{align*}
|\theta_v|^2=a^2k.
\end{align*}
Thus the binary packing becomes a Euclidean packing with squared separation at least $a^2k/2$, while the single amplitude $a$ controls both the separation and the norm of every packed vector.
[/example]
## Minimax Lower Bound for Sparse Gaussian Mean Estimation
We now apply the method to a benchmark high-dimensional model. The statistical question is how accurately one can estimate a sparse mean vector from Gaussian observations when the support is unknown.
[definition: Sparse Gaussian Mean Model]
Observe $X\in\mathbb R^d$ with
\begin{align*}
X\sim \mathcal N(\theta,\sigma^2 I_d),
\end{align*}
where the parameter belongs to
\begin{align*}
\Theta_k=\{\theta\in\mathbb R^d:|\{j:\theta_j\ne 0\}|\le k\}.
\end{align*}
For $R>0$, write
\begin{align*}
\Theta_k(R)=\{\theta\in\Theta_k:|\theta|^2\le R^2\}.
\end{align*}
The loss is the map
\begin{align*}
L:\mathbb R^d\times\mathbb R^d\to[0,\infty),
\qquad L(\hat\theta,\theta)=|\hat\theta-\theta|^2.
\end{align*}
[/definition]
The sparse packing above lives inside a bounded local slice of $\Theta_k$, and the Gaussian KL formula gives the information side. The remaining task is to choose the amplitude $a$ so that the packing fits inside the stated radius while separation is as large as possible and the KL radius stays below the packing entropy.
[quotetheorem:5903]
[citeproof:5903]
The result identifies the price of unknown support selection in a bounded sparse subproblem, and hence gives a valid lower bound for the larger unrestricted class. Each hypothesis has a concrete role. If the radius is smaller than the target scale, for instance $R^2\ll \sigma^2 k\log(d/k)$, the displayed packing does not fit in $\Theta_k(R)$ and no argument can force a lower bound larger than the diameter scale $R^2$. If $k>d/2$, the Varshamov-Gilbert sparse layer no longer supplies the entropy order used in the proof. If $\sigma^2=0$, the observation reveals $\theta$ exactly and the lower bound disappears. The theorem also does not claim a matching upper bound or an all-regime characterization; it proves the sparse-support Fano obstruction at the radius where that obstruction is present.
The Gaussian noise variance appears for an information-theoretic reason: larger $\sigma^2$ makes the same Euclidean separation harder to distinguish in KL divergence. The restriction $k\le d/2$ is inherited from the sparse packing lemma, not from the Gaussian likelihood calculation itself. Near the dense endpoint a separate packing of a Euclidean ball gives the usual $\sigma^2 d$ lower bound, so the sparse construction should be read as the sparse-regime component of the minimax theory.
[remark: Dense and Sparse Regimes]
The displayed lower bound captures the combinatorial cost of support uncertainty. When $k$ is comparable to $d$, the usual dense Gaussian mean lower bound gives order $\sigma^2 d$, while the expression $k\log(d/k)$ no longer has the correct dense behaviour at the endpoint. A full statement usually takes the minimum or maximum of sparse and dense constructions depending on the parameter range.
[/remark]
## Sparse Linear Regression Under Gaussian Design
The same entropy calculation appears in regression, but the loss is prediction error rather than direct Euclidean error. Gaussian design makes the prediction metric comparable to Euclidean distance through the covariance matrix.
[definition: Sparse Linear Regression Model]
Observe independent pairs $(Y_i,Z_i)_{i=1}^{n}$ satisfying
\begin{align*}
Y_i=Z_i\cdot\beta+\varepsilon_i,
\end{align*}
where $Z_i\sim\mathcal N(0,\Sigma)$ in $\mathbb R^d$, $\varepsilon_i\sim\mathcal N(0,\sigma^2)$, the design variables and noises are independent, and $\beta$ belongs to
\begin{align*}
\mathcal B_k=\{\beta\in\mathbb R^d:|\{j:\beta_j\ne 0\}|\le k\}.
\end{align*}
For $R>0$, write
\begin{align*}
\mathcal B_k(R)=\{\beta\in\mathcal B_k:|\beta|^2\le R^2\}.
\end{align*}
The prediction loss is the map
\begin{align*}
L_\Sigma:\mathbb R^d\times\mathbb R^d\to[0,\infty),
\qquad L_\Sigma(\hat\beta,\beta)=|\Sigma^{1/2}(\hat\beta-\beta)|^2.
\end{align*}
[/definition]
The KL divergence in regression is conditional on the design, while the target risk is prediction loss. We therefore need a local lower bound showing that the same sparse packing fits in a bounded coefficient class, remains hard after the random design is averaged out, and introduces the factor $1/n$ through the information calculation.
[quotetheorem:5904]
[citeproof:5904]
The information calculation explains why the regression rate has the same entropy term as sparse normal means but divided by $n$. In the isotropic case the relationship between Euclidean and prediction geometry has no additional constants, making the balance transparent.
The sparse eigenvalue and radius hypotheses have separate roles. If $\Sigma$ has very large variance on the selected sparse coordinates, then the data reveal the coefficient faster and the amplitude must be reduced to keep KL small; this is why $\kappa_+$ appears in the denominator. If $\Sigma$ nearly annihilates a sparse direction, prediction separation also collapses; this is why $\kappa_-$ appears in the numerator. If the coefficient radius satisfies $R^2\ll \sigma^2 k\log(d/k)/(n\kappa_+)$, the Fano packing at the displayed scale does not fit in the parameter class. The theorem does not assert the full global minimax rate in every design and sample-size regime; it proves the sparse-packing lower bound under the stated local radius and sparse-eigenvalue conditions.
[example: Isotropic Design]
When $\Sigma=I_d$, the prediction loss becomes
\begin{align*}
|\Sigma^{1/2}(\hat\beta-\beta)|^2=|I_d(\hat\beta-\beta)|^2=|\hat\beta-\beta|^2.
\end{align*}
Let $\mathcal V\subset\{0,1\}^d$ be a sparse binary packing with $d_H(v,v')\ge k/2$ for distinct $v,v'$, and set $\beta_v=a v$. For $v\ne v'$,
\begin{align*}
|\beta_v-\beta_{v'}|^2=|a(v-v')|^2=\sum_{\ell=1}^d a^2(v_\ell-v'_\ell)^2.
\end{align*}
Since $(v_\ell-v'_\ell)^2=1$ exactly on the coordinates where $v_\ell\ne v'_\ell$ and is $0$ otherwise,
\begin{align*}
\sum_{\ell=1}^d (v_\ell-v'_\ell)^2=d_H(v,v').
\end{align*}
Therefore
\begin{align*}
|\beta_v-\beta_{v'}|^2=a^2d_H(v,v')\ge \frac{a^2k}{2}.
\end{align*}
Thus the squared prediction separation is at least $a^2k/2$.
Under $\beta_v$, conditionally on the design matrix $Z$, the response vector has law $\mathcal N(Z\beta_v,\sigma^2 I_n)$, while under $0$ it has law $\mathcal N(0,\sigma^2 I_n)$. The Gaussian-shift KL formula gives
\begin{align*}
D(P_{\beta_v}(\,\cdot\,|Z)\|P_0(\,\cdot\,|Z))=\frac{|Z\beta_v|^2}{2\sigma^2}.
\end{align*}
Since the rows $Z_i$ are independent $\mathcal N(0,I_d)$,
\begin{align*}
\mathbb E_Z[|Z\beta_v|^2]=\mathbb E_Z\left[\sum_{i=1}^n (Z_i\cdot \beta_v)^2\right].
\end{align*}
Linearity of expectation gives
\begin{align*}
\mathbb E_Z\left[\sum_{i=1}^n (Z_i\cdot \beta_v)^2\right]=\sum_{i=1}^n \mathbb E_Z[(Z_i\cdot \beta_v)^2].
\end{align*}
For $Z_i\sim \mathcal N(0,I_d)$, the scalar $Z_i\cdot\beta_v$ has mean $0$ and variance $|\beta_v|^2$, so
\begin{align*}
\mathbb E_Z[(Z_i\cdot \beta_v)^2]=|\beta_v|^2.
\end{align*}
Hence
\begin{align*}
\mathbb E_Z[|Z\beta_v|^2]=n|\beta_v|^2.
\end{align*}
Because $v$ has exactly $k$ ones,
\begin{align*}
|\beta_v|^2=|av|^2=\sum_{\ell=1}^d a^2v_\ell^2=a^2\sum_{\ell=1}^d v_\ell=a^2k.
\end{align*}
Therefore the averaged KL radius to zero is
\begin{align*}
\mathbb E_Z[D(P_{\beta_v}(\,\cdot\,|Z)\|P_0(\,\cdot\,|Z))]=\frac{na^2k}{2\sigma^2}.
\end{align*}
Choose
\begin{align*}
a^2=c_1\frac{\sigma^2\log(d/k)}{n}
\end{align*}
with $c_1>0$ small enough that the displayed KL radius is a fixed small multiple of the packing entropy, which is of order $k\log(d/k)$. The corresponding squared prediction separation is
\begin{align*}
\frac{a^2k}{2}=\frac{c_1}{2}\sigma^2\frac{k\log(d/k)}{n}.
\end{align*}
Thus in the isotropic design case the Fano packing gives prediction risk at scale $\sigma^2 k\log(d/k)/n$.
[/example]
## Yang-Barron Entropy Lower Bounds
Finite packings are often enough for sparse parametric classes, but nonparametric classes may be better described by their metric entropy at every scale. The Yang-Barron viewpoint relates minimax risk to the balance between entropy growth and KL neighbourhood size.
[definition: Metric Entropy]
Let $(T,d)$ be a metric space. The metric entropy at scale $\varepsilon>0$ is
\begin{align*}
H:(0,\infty)\times\{(T,d):(T,d)\text{ is a metric space}\}\to [0,\infty],
\qquad (\varepsilon,T,d)\mapsto H(\varepsilon,T,d),
\end{align*}
where $H(\varepsilon,T,d)=\log M(\varepsilon,T,d)$ and $M(\varepsilon,T,d)$ is the packing number.
[/definition]
Entropy records how many distinguishable alternatives exist at a given resolution. We now ask for the best resolution: the useful scale is the one where the entropy of distinguishable alternatives is comparable to the information budget.
[quotetheorem:5905]
[citeproof:5905]
The theorem packages the chapter's method as a scale-selection rule. A schematic smoothness example shows how the familiar nonparametric rates arise from balancing entropy against sample information.
The theorem is only as sharp as the available entropy estimates and local KL construction. If the packing entropy is computed globally but the KL condition holds only near a smoother centre, the scale optimization can overstate the difficulty of the class. For a concrete failure mode, take a parameter space containing many well-separated points whose laws are mutually singular, such as point masses $P_\theta=\delta_\theta$ at distinct observations. The metric entropy can be arbitrarily large, but no finite reference-law average KL bound exists for a reference that misses any packing point, and the data identify $\theta$ exactly. This shows why both finiteness of the packing used at the scale and the KL-admissibility condition are part of the theorem rather than technical decoration. Its strength is that the same proof template links sparse combinatorics, function-space entropy, and later random-matrix examples through a single testing reduction.
[example: Heuristic Sobolev Entropy Balance]
For a Sobolev-type smoothness class on a $d_0$-dimensional domain, let $s>0$ be the smoothness index, and suppose that at $L^2$ scale $\varepsilon$ the packing entropy has order $\varepsilon^{-d_0/s}$. If a local packing at that scale has average KL bounded by a constant multiple of $n\varepsilon^2$, then the scale selected by *[Yang-Barron Entropy Lower Bound](/theorems/5905)* is the one where the information budget and entropy are of the same order:
\begin{align*}
n\varepsilon^2 \asymp \varepsilon^{-d_0/s}.
\end{align*}
Since $\varepsilon>0$, multiply both sides by $\varepsilon^{d_0/s}$ to get
\begin{align*}
n\varepsilon^{2+d_0/s}\asymp 1.
\end{align*}
Dividing by $n$ gives
\begin{align*}
\varepsilon^{2+d_0/s}\asymp n^{-1}.
\end{align*}
Because
\begin{align*}
2+\frac{d_0}{s}=\frac{2s+d_0}{s},
\end{align*}
this is
\begin{align*}
\varepsilon^{(2s+d_0)/s}\asymp n^{-1}.
\end{align*}
Raise both sides to the power $s/(2s+d_0)$:
\begin{align*}
\varepsilon\asymp n^{-s/(2s+d_0)}.
\end{align*}
Squaring both sides yields
\begin{align*}
\varepsilon^2\asymp n^{-2s/(2s+d_0)}.
\end{align*}
Thus the entropy-information balance predicts squared $L^2$ risk at scale $n^{-2s/(2s+d_0)}$: faster smoothness growth increases the exponent, while larger ambient dimension $d_0$ decreases it.
[/example]
The chapter's method can now be summarized as a recipe. Build a separated packing, compute or bound the KL divergence to a centre or mixture, choose the scale so that information is below entropy, and apply Fano to convert testing hardness into estimation risk. Later chapters reuse this recipe with sharper random-matrix tools and with spectral parameter spaces where the packing geometry is more subtle.
The Fano argument has shown how packings and testing hardness convert into minimax lower bounds. The next chapter refines this perspective by exploiting coordinatewise structure, where hypercube constructions and pairwise comparisons often give sharper results than global packings.
# 3. Assouad's Lemma and Coordinatewise Hardness
This chapter replaces global packing arguments by coordinatewise reductions. It assumes the preceding material on minimax risk and binary testing from Chapter 1, Fano-type lower bounds from Chapter 2, and the basic probability distances between statistical experiments introduced in Chapters 0 and 2. Fano's inequality treats a large finite family as a single multiple-testing problem, while Assouad's lemma exploits a hypercube of parameters and asks how many coordinates of the hidden vertex must be misclassified. This is especially effective when the statistical experiment factors across coordinates or nearly factors across local perturbations.
The guiding question is: if a parameter is indexed by $v \in \{0,1\}^m$, how much risk is forced by the difficulty of deciding each bit of $v$ from the data? The answer links minimax estimation to Hamming geometry, total variation distance, Hellinger affinity, and KL bounds for neighboring experiments.
## Hypercube Reductions and Hamming Geometry
A global packing lower bound can lose information when two parameters differ in many coordinates but the loss is additive across coordinates. The hypercube method keeps the coordinate structure visible: each vertex represents a parameter, and each edge represents the smallest statistical comparison needed to recover one coordinate.
[definition: Hypercube Parameterization]
Let $m \in \mathbb N$. A hypercube parameterization of a parameter space $\Theta$ is a map
\begin{align*}
\psi : \{0,1\}^m \to \Theta.
\end{align*}
For $v \in \{0,1\}^m$, write $P_v$ for the distribution of the observation under parameter $\psi(v)$.
[/definition]
The map $\psi$ restricts the original model to the finite experiment $\{P_v : v \in \{0,1\}^m\}$. To measure how many coordinate decisions separate two vertices, we now put the usual graph metric on the cube.
[definition: Hamming Distance]
The Hamming distance is the map
\begin{align*}
d_H : \{0,1\}^m \times \{0,1\}^m \to \{0,1,\dots,m\}.
\end{align*}
It is defined by
\begin{align*}
d_H(u,v) := \sum_{j=1}^m \mathbb{1}_{\{u_j \neq v_j\}}.
\end{align*}
[/definition]
Hamming distance is the natural loss for recovering the hidden vertex. To transfer this into an estimation lower bound, we require a separation condition saying that changing many coordinates changes the statistical target by a proportional amount.
[definition: Assouad Separation]
Let $(\Theta,\rho)$ be a metric space. A hypercube parameterization $\psi : \{0,1\}^m \to \Theta$ has Assouad separation $s>0$ if
\begin{align*}
\rho(\psi(u),\psi(v)) \geq s\, d_H(u,v)
\end{align*}
for all $u,v \in \{0,1\}^m$.
[/definition]
For squared Euclidean loss, it is often more convenient to formulate separation after taking square roots, or to use an additive squared separation directly. The sparse normal means cube gives the model example because every bit contributes the same amount of squared loss.
[example: Sparse Sign Hypercube]
Fix $m \le d$ and amplitude $a>0$. In the Gaussian sequence model $Y \sim \mathcal N(\theta,\sigma^2 I_d)$, define
\begin{align*}
\psi(v)_j = a(2v_j-1) \text{ for } 1 \le j \le m, \qquad \psi(v)_j = 0 \text{ for } m<j\le d.
\end{align*}
For $1\le j\le m$, the coordinate $\psi(v)_j$ is either $-a$ or $a$, and for $j>m$ it is $0$, so every $\psi(v)$ has at most $m$ nonzero coordinates.
For $u,v\in\{0,1\}^m$, first compute one active coordinate:
\begin{align*}
\psi(u)_j-\psi(v)_j = a(2u_j-1)-a(2v_j-1)=2a(u_j-v_j) \quad (1\le j\le m).
\end{align*}
For the inactive coordinates,
\begin{align*}
\psi(u)_j-\psi(v)_j=0 \quad (m<j\le d).
\end{align*}
Therefore the squared Euclidean distance is
\begin{align*}
|\psi(u)-\psi(v)|^2=\sum_{j=1}^d(\psi(u)_j-\psi(v)_j)^2.
\end{align*}
Substituting the two coordinate formulas gives
\begin{align*}
|\psi(u)-\psi(v)|^2=\sum_{j=1}^m \{2a(u_j-v_j)\}^2+\sum_{j=m+1}^d 0^2.
\end{align*}
Thus
\begin{align*}
|\psi(u)-\psi(v)|^2=4a^2\sum_{j=1}^m (u_j-v_j)^2.
\end{align*}
Since $u_j,v_j\in\{0,1\}$, the quantity $(u_j-v_j)^2$ equals $1$ when $u_j\neq v_j$ and equals $0$ when $u_j=v_j$. Hence
\begin{align*}
\sum_{j=1}^m (u_j-v_j)^2=\sum_{j=1}^m \mathbb{1}_{\{u_j\neq v_j\}}=d_H(u,v).
\end{align*}
Combining the last two displays,
\begin{align*}
|\psi(u)-\psi(v)|^2=4a^2 d_H(u,v).
\end{align*}
Thus squared error over this cube is exactly $4a^2$ times Hamming error in the sign vector. The model also factors across coordinates: under vertex $v$, the coordinates $Y_j$ are independent, with $Y_j\sim \mathcal N(a(2v_j-1),\sigma^2)$ for $1\le j\le m$ and $Y_j\sim \mathcal N(0,\sigma^2)$ for $j>m$.
[/example]
The example shows that a lower bound can be assembled from repeated one-coordinate testing problems, but a global estimator does not directly report the coordinate bits of the cube. The missing step is a decoder that converts the estimator's output into a vertex label, while preserving enough coordinate error to charge loss back to the original estimation problem. The cube geometry supplies exactly the uniform edge separation needed for that conversion.
[quotetheorem:5906]
[citeproof:5906]
The metric hypothesis and the factor $2s$ in the separation condition are not cosmetic: they are what make the nearest-neighbour decoding step valid. Without a triangle inequality, an estimator could be close to the true parameter in the loss while its nearest cube vertex is far away in Hamming distance; for instance, squared Euclidean distance can have $|0-1|^2=1$ and $|1-2|^2=1$ but $|0-2|^2=4$, so the triangle step used above fails. The uniform separation condition is also needed: if two adjacent cube vertices coincide in $\Theta$, then the corresponding bit cannot contribute a positive loss lower bound even if other coordinates are separated. The bound is useful when the edge experiments remain hard even though the full family contains $2^m$ hypotheses. To use it in concrete models, we need systematic upper bounds on the total variation distances appearing on the cube edges.
## Distances Between Neighboring Experiments
Assouad's lemma asks for upper bounds on total variation distance between neighboring vertices. Direct total variation computations are rare in high-dimensional models, so we use Hellinger and KL distances as more tractable intermediates.
[definition: Total Variation Distance]
Let $\mathcal P(\mathcal X,\mathcal A)$ denote the set of probability measures on a measurable space $(\mathcal X,\mathcal A)$. The total variation distance is the map
\begin{align*}
\operatorname{TV} : \mathcal P(\mathcal X,\mathcal A) \times \mathcal P(\mathcal X,\mathcal A) \to [0,1].
\end{align*}
It is defined by
\begin{align*}
\operatorname{TV}(P,Q) := \sup_{A \in \mathcal A} |P(A)-Q(A)|.
\end{align*}
[/definition]
Total variation has an operational meaning: it determines the optimal error probability in binary testing with equal priors. Since Assouad reduces estimation to many binary tests, the exact testing identity below is the link between probability metrics and minimax risk.
[quotetheorem:5907]
[citeproof:5907]
The equal-prior hypothesis is essential for this exact formula; unequal priors lead to a weighted total variation expression instead. For example, if $P=Q$ but the prior probabilities are $0.9$ and $0.1$, always guessing the first hypothesis has error $0.1$, not the equal-prior value displayed in the theorem. The theorem also explains why Assouad averages edge errors rather than only bounding likelihood ratios: the statistical obstruction is the unavoidable overlap of neighboring distributions. The preceding testing identity explains why total variation is the right metric, but total variation is not the quantity most likelihood calculations produce. The following definition introduces KL divergence because Gaussian shifts, product likelihoods, and many exponential-family models give KL bounds with short computations.
[definition: Kullback-Leibler Divergence]
Let $\mathcal P(\mathcal X,\mathcal A)$ denote the set of probability measures on a measurable space $(\mathcal X,\mathcal A)$. The Kullback-Leibler divergence is the map
\begin{align*}
D_{\mathrm{KL}} : \mathcal P(\mathcal X,\mathcal A) \times \mathcal P(\mathcal X,\mathcal A) \to [0,\infty].
\end{align*}
It is defined by
\begin{align*}
D_{\mathrm{KL}}(P\|Q) := \int \log\left(\frac{dP}{dQ}\right)\,dP
\end{align*}
when $P \ll Q$, and by $D_{\mathrm{KL}}(P\|Q):=\infty$ when $P \not\ll Q$.
[/definition]
The preceding definition gives a computable information measure, but Assouad still needs total variation on each edge. This creates a mismatch: KL divergence is usually easy to add and compute in product models, whereas total variation is the testing distance that appears in the lemma. A comparison inequality is needed to turn an available KL calculation into the total-variation control required on neighboring experiments.
[quotetheorem:5890]
[citeproof:5890]
Pinsker is asymmetric on the right-hand side because KL divergence is asymmetric, while total variation is symmetric. The finiteness condition matters: if $P$ charges a point that $Q$ assigns mass zero, then $D_{\mathrm{KL}}(P\|Q)=\infty$ and the inequality gives no useful edge control, even though total variation is still at most $1$. The inequality is often sharp enough for lower bounds when each edge KL is bounded by a small constant, but it may lose constants and tensor-product structure. Pinsker converts KL into total variation, but it can discard the multiplicative structure present in independent product experiments. The following definition introduces Hellinger affinity, a quantity designed for tensorization across independent coordinates.
[definition: Hellinger Affinity And Distance]
Let $\mathcal P(\mathcal X,\mathcal A)$ denote the set of probability measures on a measurable space $(\mathcal X,\mathcal A)$. For probability measures $P$ and $Q$ dominated by $\mu$, with densities $p$ and $q$, the Hellinger affinity is the map
\begin{align*}
\rho_H : \mathcal P(\mathcal X,\mathcal A) \times \mathcal P(\mathcal X,\mathcal A) \to [0,1].
\end{align*}
It is defined by
\begin{align*}
\rho_H(P,Q) := \int \sqrt{pq}\,d\mu.
\end{align*}
The squared Hellinger distance is the map
\begin{align*}
H^2 : \mathcal P(\mathcal X,\mathcal A) \times \mathcal P(\mathcal X,\mathcal A) \to [0,2].
\end{align*}
It is defined by
\begin{align*}
H^2(P,Q) := 2-2\rho_H(P,Q).
\end{align*}
[/definition]
The preceding definition is independent of the chosen dominating measure, so it is a genuine distance-type construction on probability laws. Its value here is that independent coordinates should contribute to distinguishability in a structured way. To use Hellinger affinity on cube or product experiments, we need a rule that expresses the joint affinity through the one-coordinate affinities rather than recomputing an integral in the full product space.
[quotetheorem:5908]
[citeproof:5908]
The product assumption is the essential hypothesis: with dependent coordinates there is no reason for the joint affinity to split into one-dimensional terms. For example, two coordinates that are perfectly coupled can have a joint law supported on the diagonal, so the joint affinity is controlled by the diagonal mass rather than by a product of marginal affinities. This is why Assouad constructions often try to arrange independent coordinates or condition on all but one coordinate. Tensorization gives lower bounds on affinity, whereas Assouad is stated in terms of total variation. The next comparison converts a large affinity into a small enough total variation distance for the lemma.
[quotetheorem:5909]
[citeproof:5909]
This inequality is weaker than the exact testing identity because it replaces the exact overlap by a convenient lower bound in terms of affinity. Its limitation is visible when constants matter: for identical laws $P=Q$, the left-hand side equals $1$ while the bound only gives the theorem's smaller universal constant. The hypothesis that $P$ and $Q$ are probability measures is also used through $0\le \rho_H(P,Q)\le 1$; without normalization, the elementary comparison would not have the same scale. Its advantage is that tensorization can make $\rho_H(P,Q)$ explicit for product experiments. It turns multiplicative affinity lower bounds into additive minimax lower bounds.
## Coordinatewise Hardness In Sparse Models
Sparse models create a particular obstacle: the estimator must identify many small coordinate-level signals, and a global distance between two alternatives can hide which coordinates caused the error. We now compare Assouad with Fano in the sparse settings that motivate the chapter. Fano uses a large packing and controls the average KL divergence from one reference point. Assouad uses a cube and only requires neighboring vertices to be difficult to distinguish.
[example: Sparse Vector Estimation In Squared Error]
Consider $Y \sim \mathcal N(\theta,\sigma^2 I_d)$ and
\begin{align*}
\Theta_m(a) := \{\theta \in \mathbb R^d : \theta_j \in \{-a,a\}\text{ for }1\le j\le m,\ \theta_j=0\text{ for }j>m\}.
\end{align*}
Index this class by $v\in\{0,1\}^m$ through $\theta(v)_j=a(2v_j-1)$ for $1\le j\le m$ and $\theta(v)_j=0$ for $j>m$. If $v^{(j)}$ is obtained from $v$ by flipping the $j$th bit, then the only changed coordinate is $j$, and
\begin{align*}
\theta(v)_j-\theta(v^{(j)})_j=a(2v_j-1)-a(2(1-v_j)-1).
\end{align*}
Since $2(1-v_j)-1=1-2v_j$, this becomes
\begin{align*}
\theta(v)_j-\theta(v^{(j)})_j=a(2v_j-1)-a(1-2v_j)=2a(2v_j-1).
\end{align*}
Thus this coordinate difference has square $4a^2$, and all other coordinate differences are $0$, so
\begin{align*}
|\theta(v)-\theta(v^{(j)})|^2=4a^2.
\end{align*}
For two Gaussian laws with common covariance $\sigma^2 I_d$ and means $\mu,\nu$, their densities satisfy
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=-\frac{|Y-\mu|^2}{2\sigma^2}+\frac{|Y-\nu|^2}{2\sigma^2}.
\end{align*}
Equivalently,
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=\frac{|Y-\nu|^2-|Y-\mu|^2}{2\sigma^2}.
\end{align*}
Using $Y-\nu=(Y-\mu)+(\mu-\nu)$ gives
\begin{align*}
|Y-\nu|^2=|Y-\mu|^2+2(Y-\mu)\cdot(\mu-\nu)+|\mu-\nu|^2.
\end{align*}
Therefore
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=\frac{2(Y-\mu)\cdot(\mu-\nu)+|\mu-\nu|^2}{2\sigma^2}.
\end{align*}
Taking expectation under $Y\sim\mathcal N(\mu,\sigma^2 I_d)$ gives $\mathbb E_\mu[Y-\mu]=0$, hence
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\mu,\sigma^2 I_d)\middle\|\mathcal N(\nu,\sigma^2 I_d)\right)=\frac{|\mu-\nu|^2}{2\sigma^2}.
\end{align*}
With $\mu=\theta(v)$ and $\nu=\theta(v^{(j)})$,
\begin{align*}
D_{\mathrm{KL}}(P_v\|P_{v^{(j)}})=\frac{4a^2}{2\sigma^2}=\frac{2a^2}{\sigma^2}.
\end{align*}
By *Pinsker Inequality*,
\begin{align*}
\operatorname{TV}(P_v,P_{v^{(j)}})^2 \le \frac{1}{2}D_{\mathrm{KL}}(P_v\|P_{v^{(j)}})=\frac{a^2}{\sigma^2}.
\end{align*}
Thus
\begin{align*}
\operatorname{TV}(P_v,P_{v^{(j)}})\le \frac{a}{\sigma}.
\end{align*}
Squared Euclidean loss is not a metric, so we do not apply Assouad's lemma directly with $\rho(\theta,\theta')=|\theta-\theta'|^2$. Instead, given any estimator $\hat\theta$, define the sign decoder
\begin{align*}
\hat v_j := \mathbb 1_{\{\hat\theta_j\ge 0\}}, \qquad 1\le j\le m.
\end{align*}
If $v_j=1$ and $\hat v_j=0$, then $\theta(v)_j=a$ and $\hat\theta_j<0$, so
\begin{align*}
(\hat\theta_j-\theta(v)_j)^2=(\hat\theta_j-a)^2>a^2.
\end{align*}
If $v_j=0$ and $\hat v_j=1$, then $\theta(v)_j=-a$ and $\hat\theta_j\ge 0$, so
\begin{align*}
(\hat\theta_j-\theta(v)_j)^2=(\hat\theta_j+a)^2\ge a^2.
\end{align*}
Hence every sign error contributes at least $a^2$ squared error in its coordinate, and for every $v$,
\begin{align*}
|\hat\theta-\theta(v)|^2 \ge a^2\sum_{j=1}^m \mathbb 1_{\{\hat v_j\neq v_j\}}=a^2 d_H(\hat v,v).
\end{align*}
Put the uniform prior on $v\in\{0,1\}^m$. For each coordinate $j$, after conditioning on all bits except $v_j$, estimating $v_j$ is a binary test between $P_v$ and $P_{v^{(j)}}$. By *[Binary Hypothesis Testing Characterization of Total Variation](/theorems/5907)*, its average error is at least
\begin{align*}
\frac{1-\operatorname{TV}(P_v,P_{v^{(j)}})}{2}\ge \frac{1-a/\sigma}{2}.
\end{align*}
Summing over the $m$ coordinates gives the Bayes lower bound
\begin{align*}
2^{-m}\sum_{v\in\{0,1\}^m}\mathbb E_v[d_H(\hat v,v)]\ge \frac{m}{2}\left(1-\frac{a}{\sigma}\right).
\end{align*}
Since a supremum is at least the corresponding uniform average,
\begin{align*}
\sup_{v\in\{0,1\}^m}\mathbb E_v[|\hat\theta-\theta(v)|^2]\ge a^2\cdot \frac{m}{2}\left(1-\frac{a}{\sigma}\right).
\end{align*}
Taking the infimum over $\hat\theta$ yields
\begin{align*}
\inf_{\hat\theta}\sup_{\theta\in \Theta_m(a)}\mathbb E_\theta[|\hat\theta-\theta|^2] \ge \frac{ma^2}{2}\left(1-\frac{a}{\sigma}\right),
\end{align*}
whenever $a\le \sigma$. If $a=c\sigma$ for a fixed $0<c<1$, the lower bound is
\begin{align*}
\frac{mc^2\sigma^2}{2}(1-c),
\end{align*}
so the minimax squared-error risk over this cube is at least a constant multiple of $m\sigma^2$.
[/example]
The example illustrates a coordinatewise phenomenon: each active sign contributes a constant amount of risk if its signal-to-noise ratio is bounded. A packing proof can also recover this rate, but Assouad identifies the per-coordinate source of the lower bound directly.
[remark: Relation To Fano]
Fano's inequality is usually strongest when the goal is to distinguish among many well-separated global alternatives and the average information is small compared with the logarithm of the packing size. Assouad's lemma is usually strongest when the loss decomposes over coordinates and neighboring alternatives are close. In product spaces, the [tensorization of Hellinger affinity](/theorems/5908) allows the edge difficulty to remain visible after adding many irrelevant or independent coordinates.
[/remark]
The contrast with Fano becomes sharper for variable selection. We now formalize support recovery as a Hamming problem, so that each false inclusion or false exclusion counts as one coordinate error.
[definition: Support Recovery Loss]
The support map is
\begin{align*}
S : \mathbb R^d \to 2^{\{1,\dots,d\}}.
\end{align*}
It is defined by
\begin{align*}
S(\theta) := \{j \in \{1,\dots,d\}: \theta_j \neq 0\}.
\end{align*}
The Hamming support loss is the map
\begin{align*}
L : 2^{\{1,\dots,d\}} \times 2^{\{1,\dots,d\}} \to \{0,1,\dots,d\}.
\end{align*}
It is defined by
\begin{align*}
L(\hat S,S) := |\hat S \triangle S|.
\end{align*}
[/definition]
Exact support recovery asks for this loss to be zero with high probability. Lower bounds therefore show that below a signal threshold, at least one coordinate is likely to be misclassified.
[example: Exact Support Recovery In Sparse Normal Means]
Let $Y_j=\theta_j+\sigma Z_j$ independently for $1\le j\le d$, where $Z_j\sim \mathcal N(0,1)$. Index the cube by $v\in\{0,1\}^m$ through
\begin{align*}
\theta(v)_j=av_j \quad (1\le j\le m), \qquad \theta(v)_j=0 \quad (m<j\le d).
\end{align*}
Then $S(\theta(v))=\{j\le m:v_j=1\}$. If $v^{(j)}$ is obtained from $v$ by flipping the $j$th bit, then $\theta(v)$ and $\theta(v^{(j)})$ differ only in coordinate $j$. In that coordinate,
\begin{align*}
\theta(v)_j-\theta(v^{(j)})_j=a v_j-a(1-v_j).
\end{align*}
Since $a v_j-a(1-v_j)=a(2v_j-1)$, we have
\begin{align*}
\theta(v)_j-\theta(v^{(j)})_j=a(2v_j-1).
\end{align*}
Because $v_j\in\{0,1\}$, the number $2v_j-1$ is either $-1$ or $1$, so
\begin{align*}
|\theta(v)-\theta(v^{(j)})|^2=(a(2v_j-1))^2+\sum_{\ell\neq j}0^2=a^2.
\end{align*}
For two Gaussian laws with common covariance $\sigma^2 I_d$ and means $\mu,\nu$, their likelihood ratio satisfies
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=-\frac{|Y-\mu|^2}{2\sigma^2}+\frac{|Y-\nu|^2}{2\sigma^2}.
\end{align*}
Equivalently,
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=\frac{|Y-\nu|^2-|Y-\mu|^2}{2\sigma^2}.
\end{align*}
Using $Y-\nu=(Y-\mu)+(\mu-\nu)$ gives
\begin{align*}
|Y-\nu|^2=|Y-\mu|^2+2(Y-\mu)\cdot(\mu-\nu)+|\mu-\nu|^2.
\end{align*}
Substituting this expansion into the likelihood ratio gives
\begin{align*}
\log\frac{p_\mu(Y)}{p_\nu(Y)}=\frac{2(Y-\mu)\cdot(\mu-\nu)+|\mu-\nu|^2}{2\sigma^2}.
\end{align*}
Taking expectation under $Y\sim\mathcal N(\mu,\sigma^2 I_d)$ gives $\mathbb E_\mu[Y-\mu]=0$, hence
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\mu,\sigma^2 I_d)\middle\|\mathcal N(\nu,\sigma^2 I_d)\right)=\frac{|\mu-\nu|^2}{2\sigma^2}.
\end{align*}
With $\mu=\theta(v)$ and $\nu=\theta(v^{(j)})$, this becomes
\begin{align*}
D_{\mathrm{KL}}(P_v\|P_{v^{(j)}})=\frac{a^2}{2\sigma^2}.
\end{align*}
By *Pinsker Inequality*,
\begin{align*}
\operatorname{TV}(P_v,P_{v^{(j)}})^2 \le \frac{1}{2}D_{\mathrm{KL}}(P_v\|P_{v^{(j)}})=\frac{a^2}{4\sigma^2}.
\end{align*}
Therefore
\begin{align*}
\operatorname{TV}(P_v,P_{v^{(j)}})\le \frac{a}{2\sigma}.
\end{align*}
Applying *Assouad's Lemma* to this support hypercube, with the preceding binary testing bound giving the coordinate error parameter $\alpha=a/(2\sigma)$, yields whenever $a\le 2\sigma$,
\begin{align*}
\inf_{\hat S}\sup_{\theta\in\Theta_m(a)}\mathbb E_\theta[|\hat S\triangle S(\theta)|]\ge \frac{m}{2}\left(1-\frac{a}{2\sigma}\right).
\end{align*}
If $a=c\sigma$ for a fixed $0<c<2$, then this lower bound is
\begin{align*}
\frac{m}{2}\left(1-\frac{c}{2}\right),
\end{align*}
which is a constant multiple of $m$. Since $|\hat S\triangle S(\theta)|\le m\mathbb 1_{\{\hat S\neq S(\theta)\}}$, the same bound implies
\begin{align*}
\sup_{\theta\in\Theta_m(a)}\mathbb P_\theta(\hat S\neq S(\theta))\ge \frac{1}{2}\left(1-\frac{c}{2}\right).
\end{align*}
Thus, when the signal amplitude is only a fixed sufficiently small multiple of $\sigma$, the expected number of support errors is order $m$, and exact support recovery cannot have success probability tending to $1$.
[/example]
The example gives a concrete support-recovery obstruction, but it is helpful to state the general coordinatewise consequence once. The theorem below isolates the support-indicator version of Assouad so it can be reused in variable-selection problems.
[quotetheorem:5910]
[citeproof:5910]
The uniform edge bound is the important hypothesis: if some coordinates are easy and others are hard, the sharper statement keeps the coordinate-dependent terms from Assouad's lemma. The probability bound is weaker than the expected [Hamming bound](/theorems/1639) because many support errors are collapsed into the single event $\hat S\neq S(v)$. This local lower bound is intentionally different from the sharp high-dimensional threshold for exact support recovery over all $d$ coordinates, where logarithmic factors such as $\sigma\sqrt{\log d}$ enter through multiple comparisons. The final message is methodological: use Fano when the construction is naturally a packing with many alternatives and the loss is global; use Assouad when the construction is naturally a hypercube, the loss decomposes across bits, and edge experiments can be bounded by total variation, Hellinger affinity, or KL through Pinsker's inequality.
Assouad's lemma complements Fano by turning local coordinate flips into lower bounds that decompose across bits. We now shift from abstract lower-bound constructions to sparse linear models, where these ideas determine the limits of estimation beyond what standard algorithms can achieve.
# 4. Sparse Linear Models Beyond Algorithmic Guarantees
Sparse linear models are usually introduced through algorithms: thresholding, basis pursuit, the Dantzig selector, and the Lasso. In this chapter the emphasis shifts from how to compute an estimator to what any estimator can achieve. The prerequisites are the Gaussian linear model, elementary minimax risk from Chapter 1, Kullback-Leibler divergence for Gaussian measures and Fano's inequality from Chapter 2, and the Lasso oracle bounds proved in the first high-dimensional statistics course. The same arguments also connect sparse regression to compressed sensing, random matrix theory, and multiple testing: all three ask how many alternatives can be separated through noisy linear measurements. The point is to separate statistical limits from algorithmic guarantees: restricted eigenvalue constants, compatibility constants, and beta-min conditions are not only proof devices for convex optimisation, but also encode the geometry of the experiment.
## Losses in the Gaussian Linear Model
The first question is what it means to estimate a sparse regression vector when the design itself mixes coordinates. In the Gaussian linear model, the same estimator can be judged by its prediction error, by its Euclidean error, or by exact recovery of the active variables, and these losses have different minimax behaviour.
[definition: Gaussian Linear Model]
Let $X \in \mathbb R^{n \times d}$ be a design matrix, let $\beta \in \mathbb R^d$, and let $\varepsilon \sim \mathcal N(0, \sigma^2 I_n)$. The Gaussian linear model is
\begin{align*}
y = X\beta + \varepsilon.
\end{align*}
The observation is $y \in \mathbb R^n$, the parameter is $\beta$, and $\sigma > 0$ is the noise level.
[/definition]
The model fixes the observation distribution once $X$ and $\beta$ are specified. To study high-dimensional regression rather than ordinary low-dimensional regression, the next ingredient is a parameter space that expresses the structural assumption that only a small number of coordinates matter.
[definition: Sparse Regression Class]
For $1 \le k \le d$, the $k$-sparse regression class is
\begin{align*}
\Theta_k = \{\beta \in \mathbb R^d : |\{j : \beta_j \ne 0\}| \le k\}.
\end{align*}
For $S \subset \{1,\dots,d\}$, write $\beta_S$ for the restriction of $\beta$ to coordinates in $S$, and write $S(\beta)=\{j:\beta_j\ne 0\}$.
[/definition]
Sparsity specifies which vectors are allowed, but a minimax statement also needs a loss. The following three losses measure distinct tasks: predicting future linear responses, estimating the coefficient vector, and identifying the support.
[definition: Prediction, Estimation, and Support Losses]
For a fixed design matrix $X \in \mathbb R^{n \times d}$, the fixed-design prediction loss is the map $L_{\mathrm{pred}}(\cdot,\cdot;X):\mathbb R^d \times \mathbb R^d \to [0,\infty)$ defined by
\begin{align*}
L_{\mathrm{pred}}(\hat\beta,\beta;X)=\frac{1}{n}|X(\hat\beta-\beta)|^2.
\end{align*}
The Euclidean estimation loss is the map $L_2:\mathbb R^d \times \mathbb R^d \to [0,\infty)$ defined by
\begin{align*}
L_2(\hat\beta,\beta)=|\hat\beta-\beta|^2.
\end{align*}
The exact support recovery loss is the map $L_{\mathrm{supp}}:\mathbb R^d \times \mathbb R^d \to \{0,1\}$ defined by
\begin{align*}
L_{\mathrm{supp}}(\hat\beta,\beta)=\mathbb{1}_{\{S(\hat\beta)\ne S(\beta)\}}.
\end{align*}
[/definition]
Prediction is weaker than Euclidean estimation when $X$ has a nonzero kernel, because two different vectors can induce the same fitted values. Support recovery is stronger in a different direction: it asks for a discrete object and is sensitive to small nonzero coefficients.
[example: Ordinary Least Squares Versus Sparse Prediction]
Assume $d\le n$ and $X^\top X=nI_d$. The ordinary least-squares estimator is
\begin{align*}
\hat\beta_{\mathrm{ols}}=(X^\top X)^{-1}X^\top y.
\end{align*}
Using $y=X\beta+\varepsilon$, this becomes
\begin{align*}
\hat\beta_{\mathrm{ols}}=(X^\top X)^{-1}X^\top X\beta+(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
Since $(X^\top X)^{-1}X^\top X=I_d$, we have
\begin{align*}
\hat\beta_{\mathrm{ols}}-\beta=(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
Because $\operatorname{Cov}(\varepsilon)=\sigma^2I_n$,
\begin{align*}
\operatorname{Cov}(\hat\beta_{\mathrm{ols}})=\sigma^2(X^\top X)^{-1}X^\top X(X^\top X)^{-1}.
\end{align*}
Multiplying the middle factors gives
\begin{align*}
\operatorname{Cov}(\hat\beta_{\mathrm{ols}})=\sigma^2(X^\top X)^{-1}.
\end{align*}
Since $X^\top X=nI_d$, this is
\begin{align*}
\operatorname{Cov}(\hat\beta_{\mathrm{ols}})=\sigma^2 n^{-1}I_d.
\end{align*}
The fixed-design prediction risk is
\begin{align*}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{ols}}-\beta)|^2\right]=\frac{1}{n}\mathbb E_\beta\left[(\hat\beta_{\mathrm{ols}}-\beta)^\top X^\top X(\hat\beta_{\mathrm{ols}}-\beta)\right].
\end{align*}
Substituting $X^\top X=nI_d$ gives
\begin{align*}
\mathbb E_\beta\left[\frac{1}{n}|X(\hat\beta_{\mathrm{ols}}-\beta)|^2\right]=\mathbb E_\beta\left[|\hat\beta_{\mathrm{ols}}-\beta|^2\right].
\end{align*}
The estimator is unbiased, so this mean squared error is the trace of its covariance:
\begin{align*}
\mathbb E_\beta\left[|\hat\beta_{\mathrm{ols}}-\beta|^2\right]=\operatorname{tr}(\sigma^2 n^{-1}I_d)=\frac{\sigma^2 d}{n}.
\end{align*}
If the true vector is $k$-sparse with $k\ll d$, then estimating on a known support would replace the ambient dimension $d$ by $k$, while not knowing the support adds the search factor $\log(d/k)$. The sparse minimax prediction scale is therefore
\begin{align*}
\sigma^2\frac{k\log(d/k)}{n},
\end{align*}
which is smaller than the ordinary least-squares risk $\sigma^2 d/n$ when $k\log(d/k)\ll d$.
[/example]
This example shows the basic statistical saving from sparsity. The logarithmic factor is the price of not knowing the support, while the factor $k/n$ is the parametric rate after the support is known.
## Restricted Eigenvalues as Statistical Geometry
The next question is why prediction bounds can be stated under weaker design assumptions than coefficient estimation bounds. Prediction only sees $X\delta$, while Euclidean estimation asks whether the design separates different sparse coefficient vectors in $\mathbb R^d$.
[definition: Restricted Eigenvalue]
For $m \in \mathbb N$, the $m$-sparse restricted eigenvalue lower constant is the map $\kappa_-(m;\cdot):\mathbb R^{n\times d} \to [0,\infty)$ defined by
\begin{align*}
\kappa_-(m;X)=\inf\left\{\frac{|X\delta|}{\sqrt n|\delta|}: \delta \in \mathbb R^d,\ 0<|S(\delta)|\le m\right\}.
\end{align*}
The corresponding upper constant is the map $\kappa_+(m;\cdot):\mathbb R^{n\times d} \to [0,\infty]$ defined by
\begin{align*}
\kappa_+(m;X)=\sup\left\{\frac{|X\delta|}{\sqrt n|\delta|}: \delta \in \mathbb R^d,\ 0<|S(\delta)|\le m\right\}.
\end{align*}
[/definition]
These constants say whether sparse directions survive the [linear map](/page/Linear%20Map) $\delta \mapsto X\delta/\sqrt n$. Lasso analyses often need a more asymmetric quantity, because the error vector is not assumed to be sparse but is constrained by the geometry of the $\ell^1$ penalty.
[definition: Compatibility Constant]
Let $S \subset \{1,\dots,d\}$ with $|S|=s$. The compatibility constant on $S$ is the map $\phi^2(S;\cdot):\mathbb R^{n\times d} \to [0,\infty]$ defined by
\begin{align*}
\phi^2(S;X)=\inf\left\{\frac{s|X\delta|^2}{n|\delta_S|_1^2}: \delta \in \mathbb R^d,\ |\delta_{S^c}|_1\le 3|\delta_S|_1,\ \delta_S\ne 0\right\}.
\end{align*}
[/definition]
The compatibility constant is tailored to the cone generated by $\ell^1$ regularisation. In this chapter it is used only as geometry of the model, not as a step in proving a Lasso oracle inequality.
[remark: Algorithmic Conditions and Statistical Conditions]
Restricted eigenvalue and compatibility assumptions often appear in algorithmic upper-bound proofs for the Lasso. The same constants also determine whether sparse alternatives are well separated in Kullback-Leibler divergence and in prediction distance. A failure of these constants is therefore not merely a failure of a proof technique; it may mean that the experiment cannot distinguish some sparse coefficients at the requested loss scale.
[/remark]
The remaining issue is whether these constants are reasonable in the canonical random design model. A concrete obstruction is duplicated columns: if $X_1=X_2$, then $e_1-e_2$ is a $2$-sparse direction killed by $X$, so $\kappa_-(2;X)=0$. For Gaussian designs, such exact degeneracy has probability zero, and the stronger fact needed here is uniform singular-value control over all sparse coordinate subspaces.
[quotetheorem:5911]
[citeproof:5911]
This theorem is the bridge between random matrices and minimax sparse regression. The lower bound on $n$ reflects an actual dimensional obstruction as well as a proof method: for a fixed $m$-coordinate subspace, the map $X_S:\mathbb R^m\to\mathbb R^n$ cannot be injective when $n<m$, and uniform control over all $\binom{d}{m}$ coordinate subspaces requires enough samples to pay for their metric entropy. Sharp compressed-sensing lower bounds show that an order $m\log(ed/m)$ sample size is necessary, up to constants, for Gaussian matrices to preserve all $m$-sparse Euclidean norms with fixed distortion. The theorem does not say that every fixed design is well behaved, as duplicated or highly collinear columns can force $\kappa_-(m;X)=0$. It connects forward by giving the norm equivalence that lets prediction lower bounds become Euclidean estimation lower bounds for Gaussian designs.
## Minimax Prediction Risk
We now ask for the best possible prediction accuracy over all $k$-sparse vectors. The result matches the rate achieved by Lasso-type estimators, but the lower bound does not rely on redoing the Lasso analysis.
[quotetheorem:5912]
[citeproof:5912]
The theorem is deliberately phrased with a two-sided sparse eigenvalue assumption because the fixed-design lower bound is false without design separation. If many columns are duplicated, the number of distinguishable fitted means is much smaller than the number of sparse supports, and the support-search term $k\log(d/k)$ need not appear in prediction risk. The theorem also does not claim exact constants or adaptation to unknown $k$; it identifies the statistical order once the design has enough sparse geometry. This prepares the Euclidean result below, where the same $2k$-sparse norm equivalence is used in the opposite direction to compare prediction and coefficient error.
[example: Sparse Prediction Beats Full Least Squares]
Take an orthonormal fixed design with $d\le n$ and $X^\top X=nI_d$. In this case ordinary least squares has prediction risk $\sigma^2 d/n$, as computed in the preceding ordinary least-squares example. For the numerical values $d=10^4$ and $n=2000$, this gives
\begin{align*}
\sigma^2\frac{d}{n}=\sigma^2\frac{10^4}{2000}=5\sigma^2.
\end{align*}
The sparse prediction rate from *Minimax Sparse Prediction Rate* has scale
\begin{align*}
\sigma^2\frac{k\log(d/k)}{n}.
\end{align*}
Substituting $k=50$, $d=10^4$, and $n=2000$ gives
\begin{align*}
\sigma^2\frac{k\log(d/k)}{n}=\sigma^2\frac{50\log(10^4/50)}{2000}.
\end{align*}
Since $10^4/50=200$, this becomes
\begin{align*}
\sigma^2\frac{50\log(200)}{2000}=\sigma^2\frac{\log(200)}{40}.
\end{align*}
Using the natural logarithm, $\log(200)\approx 5.30$, so the sparse scale is approximately
\begin{align*}
\sigma^2\frac{5.30}{40}=0.1325\sigma^2.
\end{align*}
The ratio between the full least-squares scale and the sparse scale is
\begin{align*}
\frac{\sigma^2 d/n}{\sigma^2 k\log(d/k)/n}=\frac{d}{k\log(d/k)}.
\end{align*}
For the displayed values,
\begin{align*}
\frac{d}{k\log(d/k)}=\frac{10^4}{50\log(200)}=\frac{200}{\log(200)}\approx 37.7.
\end{align*}
Thus, in this regime, an estimator that exploits sparsity has a prediction scale about $38$ times smaller than ordinary least squares.
[/example]
This comparison also explains why prediction can remain meaningful when $d>n$. Ordinary least squares is no longer defined without additional choices, while the sparse prediction problem still has a finite statistical rate when $k\log(d/k)\ll n$.
## Euclidean Estimation Under Normalized Gaussian Design
Prediction risk does not by itself control $|\hat\beta-\beta|^2$ unless the design is well conditioned on sparse differences. The next problem is to translate prediction lower and upper bounds into coefficient estimation bounds for random Gaussian designs.
[definition: Normalized Gaussian Design]
A normalized Gaussian design is a random matrix $X\in\mathbb R^{n\times d}$ whose entries are independent $\mathcal N(0,1)$ random variables, considered through the scaled Gram matrix $X^\top X/n$.
[/definition]
This normalisation makes the columns have length about $\sqrt n$ and puts the prediction scale on the same footing as Euclidean coefficient scale. The next theorem is needed to convert the sparse prediction minimax rate into a coefficient-estimation minimax rate using the $2k$-sparse restricted eigenvalue event.
[quotetheorem:5913]
[citeproof:5913]
This result is the sparse analogue of the classical $d/n$ rate in a $d$-dimensional Gaussian linear model. The sample-size condition is needed to ensure that $2k$-sparse differences are not hidden in the kernel of $X$; when $d>n$ and no restricted eigenvalue condition is available, Euclidean estimation can fail even though prediction is still meaningful. The theorem does not give support recovery, because squared Euclidean error may be small while a coefficient just above zero is missed. The dimension $d$ is replaced by the effective model complexity $k\log(ed/k)$, which combines estimation on a support with the cost of locating that support and leads naturally to the stronger beta-min question.
[example: Why Restricted Eigenvalues Are Needed]
Assume the first two columns of $X$ are identical, so $Xe_1=Xe_2$. For $\beta=a e_1$ and $\beta'=a e_2$, both vectors are $1$-sparse, and
\begin{align*}
X\beta=X(ae_1)=aXe_1.
\end{align*}
Similarly,
\begin{align*}
X\beta'=X(ae_2)=aXe_2=aXe_1.
\end{align*}
Therefore $X\beta=X\beta'$, so the Gaussian linear model gives the same observation law under $\beta$ and $\beta'$:
\begin{align*}
y\mid \beta \sim \mathcal N(X\beta,\sigma^2I_n)=\mathcal N(X\beta',\sigma^2I_n)\sim y\mid \beta'.
\end{align*}
Their Euclidean separation is nonzero:
\begin{align*}
|\beta-\beta'|^2=|ae_1-ae_2|^2.
\end{align*}
Factoring out $a$ gives
\begin{align*}
|ae_1-ae_2|^2=a^2|e_1-e_2|^2.
\end{align*}
Since $|e_1|^2=1$, $|e_2|^2=1$, and $e_1^\top e_2=0$,
\begin{align*}
|e_1-e_2|^2=|e_1|^2+|e_2|^2-2e_1^\top e_2=1+1-0=2.
\end{align*}
Thus
\begin{align*}
|\beta-\beta'|^2=2a^2,
\end{align*}
and hence $|\beta-\beta'|=\sqrt{2}|a|$.
For any estimator $\hat\beta(y)$ and any realized value of $y$, the triangle inequality gives
\begin{align*}
|\beta-\beta'|\le |\hat\beta(y)-\beta|+|\hat\beta(y)-\beta'|.
\end{align*}
Using $(u+v)^2\le 2u^2+2v^2$ for $u,v\ge 0$,
\begin{align*}
|\beta-\beta'|^2\le 2|\hat\beta(y)-\beta|^2+2|\hat\beta(y)-\beta'|^2.
\end{align*}
Since $|\beta-\beta'|^2=2a^2$, this implies
\begin{align*}
|\hat\beta(y)-\beta|^2+|\hat\beta(y)-\beta'|^2\ge a^2.
\end{align*}
Taking expectation under the common distribution of $y$ gives
\begin{align*}
\mathbb E_\beta|\hat\beta-\beta|^2+\mathbb E_{\beta'}|\hat\beta-\beta'|^2\ge a^2.
\end{align*}
Therefore
\begin{align*}
\max\{\mathbb E_\beta|\hat\beta-\beta|^2,\mathbb E_{\beta'}|\hat\beta-\beta'|^2\}\ge \frac{a^2}{2}.
\end{align*}
As $a$ can be arbitrarily large, no estimator can have uniformly bounded Euclidean risk over this sparse class, even though prediction sees no difference between the two coefficient vectors.
[/example]
The example isolates the role of design geometry. Without separation of sparse columns, coefficient estimation and support recovery are statistically ill-posed even when prediction remains possible.
## Support Recovery and Signal Strength
The final question is when the exact set of active variables can be recovered. Unlike prediction and Euclidean estimation, support recovery is impossible unless every nonzero coefficient is large enough to overcome the noise and the multiplicity of $d$ possible coordinates.
[definition: Beta-Min Sparse Class]
For $a>0$, the beta-min sparse class is
\begin{align*}
\Theta_k(a)=\{\beta\in\Theta_k: S(\beta)\ne\varnothing,\ \min_{j\in S(\beta)} |\beta_j|\ge a\}.
\end{align*}
[/definition]
The parameter $a$ is the minimal signal amplitude. If it is too small, the support problem contains many nearly indistinguishable testing problems, one for each coordinate that may or may not be active.
[quotetheorem:5914]
[citeproof:5914]
The theorem states only a necessary condition, but it captures the right scale for familiar procedures under incoherent or restricted-eigenvalue designs. The restricted-eigenvalue hypothesis is needed to translate coefficient amplitude into distinguishable mean vectors; if a column is nearly a copy of another, the active coordinate can be statistically ambiguous even when $a$ is not small. The theorem does not assert that the displayed beta-min condition alone is sufficient, since sufficiency also depends on irrepresentability, incoherence, or a comparable condition for the chosen procedure. The support problem is harder than estimating the vector in average squared error because a single missed small coefficient causes failure, so the chapter ends with a genuinely coordinatewise requirement rather than another global risk bound.
[example: The Beta-Min Scale]
Suppose $\sigma=1$, $d=10^5$, and $n=1000$. By *Necessary Signal Strength for Support Recovery*, the beta-min scale is proportional to
\begin{align*}
\sigma\sqrt{\frac{\log d}{n}}.
\end{align*}
Substituting the displayed values gives
\begin{align*}
\sigma\sqrt{\frac{\log d}{n}}=1\cdot \sqrt{\frac{\log(10^5)}{1000}}.
\end{align*}
Since $10^5$ is a power of $10$, the logarithm satisfies
\begin{align*}
\log(10^5)=5\log(10).
\end{align*}
Using the natural logarithm, $\log(10)\approx 2.302585$, so
\begin{align*}
5\log(10)\approx 5\cdot 2.302585=11.512925.
\end{align*}
Therefore
\begin{align*}
\sqrt{\frac{\log(10^5)}{1000}}\approx \sqrt{\frac{11.512925}{1000}}.
\end{align*}
Dividing by $1000$ gives
\begin{align*}
\frac{11.512925}{1000}=0.011512925.
\end{align*}
Taking the square root gives
\begin{align*}
\sqrt{0.011512925}\approx 0.1073.
\end{align*}
Thus the necessary beta-min scale is about $0.11$, up to the constants in the lower bound. Coefficients substantially below this level cannot be uniformly separated from noise across $10^5$ possible variables, even if the fitted mean can still be predicted accurately.
[/example]
This last example summarises the chapter's hierarchy of tasks. Prediction asks for the fitted mean and has rate
\begin{align*}
\sigma^2\frac{k\log(d/k)}{n};
\end{align*}
Euclidean estimation has the same order under Gaussian restricted eigenvalues; exact support recovery additionally requires a coordinatewise signal strength of order
\begin{align*}
\sigma\sqrt{\frac{\log d}{n}}.
\end{align*}
Sparse linear models provided the first major setting where minimax rates depend on sparsity, dimension, and noise in a precise way. The next chapter recasts sparse recovery as a compressed sensing problem, where the same statistical questions are expressed through linear measurement operators and geometric recovery conditions.
# 5. Compressed Sensing as a Statistical Experiment
Compressed sensing asks how many linear measurements are needed to recover a high-dimensional vector when the vector has a sparse representation. Chapters 1 and 2 used packing and testing to identify minimax barriers; this chapter applies the same logic to linear observation schemes, while Chapter 4's sparse-regression geometry reappears as restricted isometry. The main statistical experiment is a sensing matrix followed by either exact reconstruction or noisy estimation. The prerequisites are basic linear algebra, Euclidean concentration inequalities, packing lower bounds, and the minimax testing ideas developed earlier in the course.
## Sparse Recovery from Linear Measurements
The first question is whether a vector in $\mathbb R^d$ can be identified from $n$ linear measurements when $n \ll d$. Without structure this is impossible because the measurement map has a nonzero null space. Sparsity changes the problem by replacing all of $\mathbb R^d$ with a union of low-dimensional coordinate subspaces.
[definition: Sparse Vector]
Let $d \in \mathbb N$ and $1 \le k \le d$. A vector $x \in \mathbb R^d$ is $k$-sparse if
\begin{align*}
|\{j \in \{1,\dots,d\}: x_j \ne 0\}| \le k.
\end{align*}
The set of all $k$-sparse vectors in $\mathbb R^d$ is denoted $\Sigma_k$.
[/definition]
The set $\Sigma_k$ has dimension $k$ after the support is fixed, but the support is part of the unknown. The next example shows why a measurement scheme must see every sparse coordinate pattern.
[example: Coordinate Projection Loses Sparse Signals]
Let $A:\mathbb R^d\to\mathbb R^n$ be the coordinate projection $Ax=(x_1,\dots,x_n)$, where $n<d$. Let $e_{n+1}$ be the standard basis vector whose $(n+1)$-st coordinate is $1$ and whose other coordinates are $0$. Its support is $\{n+1\}$, so $|\operatorname{supp}(e_{n+1})|=1$ and therefore $e_{n+1}\in\Sigma_1$.
Applying $A$ keeps only the first $n$ coordinates. For every $1\le j\le n$, the $j$-th coordinate of $e_{n+1}$ is $0$, so
\begin{align*}
Ae_{n+1}=((e_{n+1})_1,\dots,(e_{n+1})_n)=(0,\dots,0).
\end{align*}
Also
\begin{align*}
A0=(0,\dots,0).
\end{align*}
Since $e_{n+1}\ne0$, the two distinct $1$-sparse vectors $0$ and $e_{n+1}$ have the same measurements. Thus this projection cannot recover all $1$-sparse vectors; the measured coordinates miss an entire sparse direction, so the geometry of the rows of $A$ matters, not only the number of rows.
[/example]
This failure motivates treating the measurement matrix as part of the statistical design. To compare exact and noisy observations, we next name the compressed sensing experiment.
[definition: Compressed Sensing Experiment]
Let $A \in \mathbb R^{n \times d}$ be a measurement matrix. In the noiseless compressed sensing experiment, one observes
\begin{align*}
y=Ax,
\end{align*}
where $x\in\Sigma_k$. In the Gaussian noisy compressed sensing experiment, one observes
\begin{align*}
y=Ax+\varepsilon, \qquad \varepsilon\sim\mathcal N(0,\sigma^2 I_n),
\end{align*}
where $x\in\Sigma_k$ and $\sigma>0$.
[/definition]
The ideal decoder searches over $\Sigma_k$, but that is a nonconvex procedure over unknown supports. The practical obstruction is that exact sparsity is combinatorial, while the measurements only impose linear constraints. A convex relaxation must therefore encourage sparse solutions without explicitly enumerating supports, and the $\ell^1$ norm is the standard penalty used for this purpose.
[definition: Basis Pursuit]
Let $A \in \mathbb R^{n \times d}$ and $y\in\mathbb R^n$. A basis pursuit solution is any solution of
\begin{align*}
\hat{x}\in\operatorname*{argmin}_{z\in\mathbb R^d}\|z\|_1
\quad\text{subject to}\quad Az=y.
\end{align*}
[/definition]
Basis pursuit exploits the coordinate-aligned corners of the $\ell^1$ ball. To know when this relaxation preserves the sparse solution, we need to control the directions that can be added without changing the measurements.
[definition: Null-Space Property]
Let $A\in\mathbb R^{n\times d}$ and $1\le k\le d$. The matrix $A$ satisfies the null-space property of order $k$ if, for every $h\in\ker(A)\setminus\{0\}$ and every $S\subset\{1,\dots,d\}$ with $|S|\le k$,
\begin{align*}
\|h_S\|_1<\|h_{S^c}\|_1.
\end{align*}
[/definition]
The null-space property says that no invisible perturbation is mostly concentrated on a small support. This is exactly the geometric obstruction to $\ell^1$ minimization.
The issue is that basis pursuit only sees the affine set $\{z:Az=Ax\}$, so any nonzero vector in $\ker(A)$ gives a competing feasible point $x+h$. Exact recovery can hold for every $k$-sparse $x$ only if every such invisible perturbation increases the $\ell^1$ norm away from the sparse support. The following result identifies that obstruction precisely.
[quotetheorem:5915]
[citeproof:5915]
The strict inequality in the null-space property is essential: if equality were allowed for some $h$ and $S$, then two feasible points could have the same $\ell^1$ norm and basis pursuit would not single out the sparse vector. The theorem is deterministic and says nothing about how to check the property efficiently for a given large matrix. Its role is to turn algorithmic exact recovery into a geometric condition on $\ker(A)$.
The characterisation is exact, but it is hard to verify directly because it quantifies over all null vectors. We therefore introduce a stronger metric condition that is easier to prove for random matrices and also gives stability.
[definition: Restricted Isometry Property]
Let $A\in\mathbb R^{n\times d}$ and $1\le k\le d$. The restricted isometry constant $\delta_k(A)$ is the infimum of all $\delta\ge0$ such that
\begin{align*}
(1-\delta)|x|^2\le |Ax|^2\le(1+\delta)|x|^2
\end{align*}
for every $x\in\Sigma_k$.
[/definition]
RIP says that every sparse coordinate subspace is embedded almost isometrically.
The first obstruction to any recovery method is collision: two different $k$-sparse vectors might produce the same measurements. Such a collision has difference supported on at most $2k$ coordinates, so the relevant question is whether $A$ can vanish on a nonzero $2k$-sparse vector. The following result records the resulting identifiability criterion.
[quotetheorem:5916]
[citeproof:5916]
The hypothesis $\delta_{2k}(A)<1$ is exactly tied to comparing two $k$-sparse vectors, since their difference may have support size $2k$. A condition only at order $k$ would miss collisions between distinct supports. This theorem gives identifiability, not a stable inverse and not an efficient decoder.
Injectivity identifies the right sparse vector, but it does not say that basis pursuit finds it.
The remaining obstruction is algorithmic rather than set-theoretic: even when no two sparse vectors collide, the $\ell^1$ minimizer could still move along a nearly sparse null direction and lower the objective. To rule this out, one needs a quantitative sparse near-isometry strong enough to force the null-space property. The following result gives one standard RIP route from identifiability to basis pursuit recovery.
[quotetheorem:5917]
[citeproof:5917]
The smallness of the RIP constant is stronger than injectivity: matrices with $\delta_{2k}(A)$ close to $1$ may distinguish sparse vectors but still have nearly flat sparse directions, which makes $\ell^1$ recovery unstable. The theorem also does not claim that the displayed threshold is sharp; different proofs give different numerical constants. Its main consequence is qualitative and robust: near-isometry on sparse vectors forces the convex relaxation to agree with the sparse search.
This theorem explains why random matrices are useful: it is enough to show that they preserve all sparse directions. The next example gives the Gaussian model and the entropy calculation behind the sample size.
[example: Gaussian Sensing Matrix]
Let $A_{ij}\sim\mathcal N(0,1/n)$ independently. Fix a support $S\subset\{1,\dots,d\}$ with $|S|=k$, and fix $x\in\mathbb R^d$ supported on $S$ with $|x|=1$. For each row $i$,
\begin{align*}
(Ax)_i=\sum_{j\in S}A_{ij}x_j.
\end{align*}
This is a centered Gaussian [random variable](/page/Random%20Variable) because it is a linear combination of independent centered Gaussian variables. Its variance is
\begin{align*}
\operatorname{Var}((Ax)_i)=\sum_{j\in S}x_j^2\operatorname{Var}(A_{ij})=\sum_{j\in S}x_j^2\frac1n=\frac{|x|^2}{n}=\frac1n.
\end{align*}
The rows of $A$ are independent, so $(Ax)_1,\dots,(Ax)_n$ are independent $\mathcal N(0,1/n)$ variables. Equivalently, if $g_1,\dots,g_n$ are independent $\mathcal N(0,1)$ variables, then
\begin{align*}
|Ax|^2=\sum_{i=1}^n(Ax)_i^2\stackrel{d}{=}\sum_{i=1}^n\frac{g_i^2}{n}=\frac{\chi_n^2}{n}.
\end{align*}
For a fixed unit vector, chi-square concentration gives constants $c,C>0$ such that, for $0<\delta<1$,
\begin{align*}
\mathbb P\left(\left||Ax|^2-1\right|>\delta\right)\le 2\exp(-c\delta^2 n).
\end{align*}
To make this uniform on one support $S$, take an $\eta$-net $N_S$ of the unit sphere in $\mathbb R^S$ with
\begin{align*}
|N_S|\le \left(\frac3\eta\right)^k.
\end{align*}
A union bound over the net gives
\begin{align*}
\mathbb P\left(\exists u\in N_S:\left||Au|^2-1\right|>\delta/2\right)\le 2\left(\frac3\eta\right)^k\exp(-c\delta^2 n).
\end{align*}
Since $\left(\frac3\eta\right)^k=\exp(k\log(3/\eta))$, this upper bound is
\begin{align*}
2\exp\left(k\log(3/\eta)-c\delta^2 n\right).
\end{align*}
Choosing $\eta$ as a sufficiently small constant multiple of $\delta$ and using the standard net approximation step converts control on $N_S$ into
\begin{align*}
(1-\delta)|x|^2\le |Ax|^2\le(1+\delta)|x|^2
\end{align*}
for every $x$ supported on $S$.
There are
\begin{align*}
\binom dk\le \left(\frac{ed}{k}\right)^k
\end{align*}
supports of size $k$. A second union bound gives total failure probability at most
\begin{align*}
2\left(\frac{ed}{k}\right)^k\left(\frac3\eta\right)^k\exp(-c\delta^2 n).
\end{align*}
Equivalently, this is
\begin{align*}
2\exp\left(k\log(ed/k)+k\log(3/\eta)-c\delta^2 n\right).
\end{align*}
Thus the failure probability is small once
\begin{align*}
n\ge C\delta^{-2}k\log(ed/k),
\end{align*}
with $C$ large enough to absorb the fixed net factor. Hence a Gaussian matrix with entries $\mathcal N(0,1/n)$ preserves all $k$-sparse vectors with high probability at the sample size $n\asymp k\log(d/k)$; this is the Johnson-Lindenstrauss net-and-union-bound mechanism applied to the sparse unit sphere.
[/example]
## Information-Theoretic Sample Complexity
The next question is whether the logarithmic factor is intrinsic. Minimax theory answers by comparing the number of possible supports with the information carried by $n$ linear measurements. The conclusion is that the RIP sample size is also the information-theoretic order for uniform recovery.
[definition: Uniform Exact Recovery]
A measurement matrix $A\in\mathbb R^{n\times d}$ allows uniform exact recovery over $\Sigma_k$ if there exists a decoder $\Delta:\mathbb R^n\to\mathbb R^d$ such that
\begin{align*}
\Delta(Ax)=x
\end{align*}
for every $x\in\Sigma_k$.
[/definition]
Uniform exact recovery is a worst-case requirement over all sparse vectors, but by itself it is only an algebraic identifiability notion. A linear map can be injective on a finite or countably chosen family of sparse signals while sending two well-separated sparse vectors to measurements that are arbitrarily close; an exact decoder could still separate them in the noiseless model, but any perturbation of the measurements would destroy that separation. This means that a pure exact-recovery assumption is too weak for the volumetric and minimax arguments used in the course, because those arguments compare separated balls rather than isolated points. The lower bound below adds the missing metric hypothesis: the sensing map must preserve distances on the sparse model up to constant distortion, so the many possible sparse supports have to occupy genuine volume in $\mathbb R^n$.
[quotetheorem:5918]
[citeproof:5918]
The metric hypotheses are essential. Without stability or noise robustness, arbitrary exact nonlinear decoders can exploit algebraic encodings and the counting argument no longer applies. With constant distortion, however, the lower bound explains why convex methods with $k\log(d/k)$ measurements are statistically rate-optimal. The next example isolates the combinatorial source of the logarithm.
[example: Support Identification Cost]
For vectors $x\in\{0,1\}^d$ with exactly $k$ nonzero entries, choosing $x$ is the same as choosing its support
\begin{align*}
S=\{j:x_j=1\}\subset\{1,\dots,d\},\qquad |S|=k.
\end{align*}
Thus the number of alternatives is
\begin{align*}
\binom dk=\frac{d(d-1)\cdots(d-k+1)}{k!}.
\end{align*}
For the upper bound, $d-r\le d$ for $0\le r\le k-1$, and the standard Stirling bound $k!\ge(k/e)^k$ gives
\begin{align*}
\binom dk\le \frac{d^k}{(k/e)^k}=\left(\frac{ed}{k}\right)^k.
\end{align*}
Taking logarithms,
\begin{align*}
\log\binom dk\le k\log(ed/k)=k\log(d/k)+k.
\end{align*}
Since $k\le d/2$, we have $d/k\ge2$, so
\begin{align*}
k\le \frac{k\log(d/k)}{\log2}.
\end{align*}
Therefore
\begin{align*}
\log\binom dk\le \left(1+\frac1{\log2}\right)k\log(d/k).
\end{align*}
For the lower bound, first suppose $d/k\ge4$. Then $d-r\ge d-k+1\ge d/2$ for $0\le r\le k-1$, while $k!\le k^k$, so
\begin{align*}
\binom dk\ge \frac{(d/2)^k}{k^k}=\left(\frac{d}{2k}\right)^k.
\end{align*}
Taking logarithms gives
\begin{align*}
\log\binom dk\ge k\log(d/k)-k\log2.
\end{align*}
Because $d/k\ge4$, we have $\log(d/k)\ge2\log2$, hence
\begin{align*}
k\log2\le \frac12 k\log(d/k).
\end{align*}
Thus, in this case,
\begin{align*}
\log\binom dk\ge \frac12 k\log(d/k).
\end{align*}
It remains to cover the endpoint range $2\le d/k<4$. Since $k\le d/2$, we have $d-k\ge k$. Among the first $2k$ coordinates, choosing exactly one element from each pair $\{1,2\},\{3,4\},\dots,\{2k-1,2k\}$ gives $2^k$ distinct $k$-element supports. Hence
\begin{align*}
\binom dk\ge 2^k.
\end{align*}
Taking logarithms,
\begin{align*}
\log\binom dk\ge k\log2.
\end{align*}
In the range $d/k<4$,
\begin{align*}
k\log(d/k)\le k\log4=2k\log2,
\end{align*}
so
\begin{align*}
\log\binom dk\ge \frac12 k\log(d/k).
\end{align*}
Combining the upper and lower bounds, there are universal constants $c,C>0$ such that
\begin{align*}
c\,k\log(d/k)\le \log\binom dk\le C\,k\log(d/k),
\qquad 1\le k\le d/2.
\end{align*}
Thus a uniform recovery method must distinguish exponentially many possible supports, and the unknown support contributes the logarithmic factor $\log(d/k)$ to the measurement complexity.
[/example]
The noisy experiment replaces exact equality by estimation error.
Noise creates a different obstruction from the noiseless case: even with the correct support, the observations cannot determine the coefficients more accurately than the Gaussian fluctuation level, and the unknown support contributes the same combinatorial entropy as before. The statistical question is therefore the best possible squared-error scale over all estimators, not whether a particular convex program succeeds. The following theorem records that minimax rate.
[quotetheorem:5919]
[citeproof:5919]
The normalization matters: if the observation model were rescaled by an additional factor of $\sqrt n$, the displayed parameter rate would change accordingly. This is an estimation rate, not a support recovery guarantee. Exact support recovery also needs a beta-min condition separating nonzero coordinates from the noise floor.
[example: Gaussian Measurements with Noisy Sparse Signals]
Let $A_{ij}\sim\mathcal N(0,1/n)$ independently and let $x\in\Sigma_k$. Fix a RIP threshold $\delta_*$ for the noisy recovery theorem. The Gaussian sensing matrix net calculation above gives constants $C,c>0$ such that if
\begin{align*}
n\ge C\delta_*^{-2}k\log(ed/k),
\end{align*}
then
\begin{align*}
\mathbb P(\delta_{2k}(A)\le \delta_*)\ge 1-2\exp(-c\delta_*^2 n).
\end{align*}
On this event, in the noisy model
\begin{align*}
y=Ax+\varepsilon,\qquad \varepsilon\sim\mathcal N(0,\sigma^2I_n),
\end{align*}
the the noisy sparse recovery rate result gives, for a suitable noise-aware $\ell^1$ estimator,
\begin{align*}
|\hat{x}-x|\le C_1\sigma\sqrt{k\log(d/k)}
\end{align*}
with high probability, where $C_1$ depends only on the RIP constants.
This Euclidean guarantee does not by itself force support recovery. Let
\begin{align*}
r=C_1\sigma\sqrt{k\log(d/k)}
\end{align*}
and suppose $j\in\operatorname{supp}(x)$ has $0<|x_j|\le r$. Define
\begin{align*}
z=x-x_j e_j.
\end{align*}
Then $z_j=0$, so $j\notin\operatorname{supp}(z)$, while
\begin{align*}
|z-x|=|-x_j e_j|=|x_j|\,|e_j|=|x_j|\le r.
\end{align*}
Thus a vector with the wrong support can lie inside the same Euclidean error scale allowed by the estimator. Exact support recovery therefore needs an additional beta-min condition, such as $\min_{j\in\operatorname{supp}(x)}|x_j|$ being larger than a constant multiple of $\sigma\sqrt{k\log(d/k)}$.
[/example]
## Random Measurements and Johnson-Lindenstrauss Intuition
The next problem is to understand why Gaussian and subgaussian measurements preserve all sparse vectors using so few rows. A fixed vector only needs one concentration estimate, while uniform recovery requires concentration over an exponentially large structured set. Nets and union bounds convert pointwise concentration into RIP.
[quotetheorem:5920]
[citeproof:5920]
The independence, centering, and subgaussian assumptions supply the pointwise concentration estimate; without a tail assumption of this kind, a few rows can dominate and RIP may fail even when the entries have the correct variance. The theorem is also uniform only over sparse vectors, not over all of $\mathbb R^d$. This is the Johnson-Lindenstrauss proof template applied to the sparse unit sphere. The metric entropy is of order $k\log(d/k)$, so the target dimension has that order.
[remark: Scaling of the Measurement Matrix]
The normalization $A_{ij}\sim\mathcal N(0,1/n)$ gives $\mathbb E[|Ax|^2]=|x|^2$. With unnormalized entries $G_{ij}\sim\mathcal N(0,1)$, the matrix $n^{-1/2}G$ is the natural RIP-scaled sensing matrix.
[/remark]
The probability bound is uniform over sparse vectors after the matrix is drawn. The next example separates this from easier fixed-vector concentration.
[example: Fixed Vector Versus Uniform Sparse Control]
Fix $x\in\mathbb R^d$. If $x=0$, then $Ax=0$, so the concentration statement is trivial. If $x\ne0$, then for each row $i$,
\begin{align*}
(Ax)_i=\sum_{j=1}^d A_{ij}x_j.
\end{align*}
The entries in row $i$ are independent centered Gaussians with variance $1/n$, so the sum is centered Gaussian and its variance is
\begin{align*}
\operatorname{Var}((Ax)_i)=\sum_{j=1}^d x_j^2\operatorname{Var}(A_{ij})=\sum_{j=1}^d x_j^2\frac1n=\frac{|x|^2}{n}.
\end{align*}
The rows are independent, hence $(Ax)_1,\dots,(Ax)_n$ are independent $\mathcal N(0,|x|^2/n)$ variables. Writing $g_1,\dots,g_n$ for independent $\mathcal N(0,1)$ variables gives
\begin{align*}
|Ax|^2=\sum_{i=1}^n(Ax)_i^2\stackrel{d}{=}\frac{|x|^2}{n}\sum_{i=1}^n g_i^2.
\end{align*}
Therefore
\begin{align*}
\frac{|Ax|^2}{|x|^2}\stackrel{d}{=}\frac{\chi_n^2}{n}.
\end{align*}
By chi-square concentration, for $0<\delta<1$,
\begin{align*}
\mathbb P\left(\left||Ax|^2-|x|^2\right|>\delta |x|^2\right)\le 2\exp(-c\delta^2 n).
\end{align*}
Thus a fixed vector only requires $n$ of order $\delta^{-2}$ to make this failure probability bounded by a fixed small constant.
Uniform sparse control asks for the same estimate simultaneously over every unit vector supported on at most $k$ coordinates. For one support $S$ with $|S|=k$, take an $\eta$-net $N_S$ of the unit sphere in $\mathbb R^S$ satisfying
\begin{align*}
|N_S|\le \left(\frac3\eta\right)^k.
\end{align*}
Applying the fixed-vector bound to each $u\in N_S$ and then using the union bound gives
\begin{align*}
\mathbb P\left(\exists u\in N_S:\left||Au|^2-1\right|>\delta/2\right)\le 2\left(\frac3\eta\right)^k\exp(-c\delta^2 n).
\end{align*}
There are at most
\begin{align*}
\binom dk\le \left(\frac{ed}{k}\right)^k
\end{align*}
supports of size $k$. A second union bound therefore gives
\begin{align*}
\mathbb P\left(\exists S,\exists u\in N_S:\left||Au|^2-1\right|>\delta/2\right)\le 2\left(\frac{ed}{k}\right)^k\left(\frac3\eta\right)^k\exp(-c\delta^2 n).
\end{align*}
Equivalently,
\begin{align*}
2\left(\frac{ed}{k}\right)^k\left(\frac3\eta\right)^k\exp(-c\delta^2 n)=2\exp\left(k\log(ed/k)+k\log(3/\eta)-c\delta^2 n\right).
\end{align*}
Choosing $\eta$ to be a sufficiently small constant multiple of $\delta$, the standard net approximation step converts control on each $N_S$ into control on the whole unit sphere in $\mathbb R^S$. The exponent is negative once
\begin{align*}
n\ge C\delta^{-2}k\log(ed/k),
\end{align*}
with $C$ large enough to absorb the net factor. The difference is the entropy cost: one fixed vector costs only $\delta^{-2}$ measurements, while uniform recovery pays for the net inside each support and for the $\binom dk$ possible supports, producing the term $k\log(ed/k)$.
[/example]
## Stable Recovery and Approximate Sparsity
The next issue is that real signals are rarely exactly sparse and measurements are rarely noiseless. Stable recovery asks whether reconstruction error degrades continuously with noise and with the part of the signal outside its largest coordinates. RIP gives this robustness.
[definition: Best Sparse Approximation Error]
For $x\in\mathbb R^d$ and $1\le k\le d$, define
\begin{align*}
\sigma_k(x)_1:=\inf_{z\in\Sigma_k}\|x-z\|_1.
\end{align*}
[/definition]
This quantity is the $\ell^1$ mass left after retaining the largest $k$ coordinates of $x$. To use it with noisy measurements, the equality constraint in basis pursuit must be widened.
[definition: Basis Pursuit Denoising]
Let $A\in\mathbb R^{n\times d}$, $y\in\mathbb R^n$, and $\eta\ge0$. A basis pursuit denoising solution is any solution of
\begin{align*}
\hat{x}\in\operatorname*{argmin}_{z\in\mathbb R^d}\|z\|_1
\quad\text{subject to}\quad |Az-y|\le\eta.
\end{align*}
[/definition]
The feasible set now permits measurement error, so exact equality is no longer the target conclusion.
There are now two unavoidable ways recovery can fail to be exact. Measurement noise means that many vectors can fit the data within tolerance, while approximate sparsity means that the part of $x$ outside its largest $k$ coordinates is not protected by a $k$-sparse geometric condition. A useful deterministic theorem must keep both effects visible in the error bound, because those are the terms that later become statistical rates.
[quotetheorem:5921]
[citeproof:5921]
Both terms in the bound are necessary. If $\eta>0$ and $x$ is exactly sparse, the reconstruction cannot be more accurate than the noise level allowed by the feasible set; if $\eta=0$ but $x\notin\Sigma_k$, the omitted tail cannot be reconstructed from a theorem designed for $k$-sparse geometry. Stable recovery is the deterministic statement behind noisy compressed sensing rates. The next example shows how approximate sparsity enters through the tail term.
[example: Compressible Power-Law Coefficients]
Let $x\in\mathbb R^d$ have decreasing rearrangement $|x|_{(1)}\ge\cdots\ge |x|_{(d)}$ satisfying $|x|_{(j)}\le Rj^{-a}$ with $a>1$. The best $k$-sparse approximation keeps the $k$ largest coordinates and discards the remaining coordinates, so its $\ell^1$ error is the tail:
\begin{align*}
\sigma_k(x)_1=\sum_{j=k+1}^d |x|_{(j)}.
\end{align*}
Using the power-law bound on each rearranged coordinate gives
\begin{align*}
\sigma_k(x)_1\le R\sum_{j=k+1}^d j^{-a}.
\end{align*}
Since the finite tail is bounded by the infinite tail,
\begin{align*}
R\sum_{j=k+1}^d j^{-a}\le R\sum_{j=k+1}^\infty j^{-a}.
\end{align*}
Because $t\mapsto t^{-a}$ is decreasing on $(0,\infty)$,
\begin{align*}
\sum_{j=k+1}^\infty j^{-a}\le \int_k^\infty t^{-a}\,dt.
\end{align*}
Evaluating the integral,
\begin{align*}
\int_k^\infty t^{-a}\,dt=\frac{k^{1-a}}{a-1}.
\end{align*}
Therefore
\begin{align*}
\sigma_k(x)_1\le \frac{R}{a-1}k^{1-a}.
\end{align*}
Substituting this tail estimate into *Stable Recovery under RIP* gives
\begin{align*}
|\hat{x}-x|\le C_0\frac{\sigma_k(x)_1}{\sqrt{k}}+C_1\eta.
\end{align*}
The tail term is bounded by
\begin{align*}
C_0\frac{\sigma_k(x)_1}{\sqrt{k}}\le \frac{C_0R}{a-1}k^{1/2-a}.
\end{align*}
Thus power-law coefficient decay with exponent $a>1$ turns approximate sparsity into a Euclidean approximation contribution proportional to $Rk^{1/2-a}$, with the separate noise contribution $C_1\eta$ when the measurements are noisy.
[/example]
## Phase Transitions and Donoho-Tanner Geometry
The final question is why compressed sensing displays sharp empirical thresholds when $n,d,k$ grow proportionally. Worst-case sample complexity gives the order $k\log(d/k)$ in very sparse regimes, but proportional asymptotics reveal a finer phase diagram. Donoho-Tanner theory studies this diagram through random projections of polytopes.
[definition: Proportional Growth Regime]
A sequence of compressed sensing problems is in the proportional growth regime if $d\to\infty$ and
\begin{align*}
\frac nd\to\delta\in(0,1),\qquad \frac kn\to\rho\in(0,1).
\end{align*}
[/definition]
The parameters $\delta$ and $\rho$ describe undersampling and sparsity relative to the number of measurements.
In this regime, order estimates such as $k\log(d/k)$ no longer describe the observed boundary sharply, because both the number of measurements and the sparsity grow linearly with the ambient dimension. The obstruction is geometric: random projection may or may not preserve the faces of the cross-polytope corresponding to sparse signed supports.
The theorem below concerns a random support and random sign model, not worst-case uniform recovery over every sparse vector. The signal support is sampled among $k$-subsets, the nonzero signs are sampled independently, and recovery means that basis pursuit returns that sampled sparse vector exactly from noiseless measurements. This belongs as a theorem card because it is the precise asymptotic boundary behind the empirical phase diagrams for $\ell^1$ recovery, but it should be read as a landmark from conic integral geometry rather than as another minimax lower bound.
[quotetheorem:5922]
This result belongs to conic integral geometry and the neighborliness theory of randomly projected cross-polytopes, rather than to the minimax and concentration toolkit developed in the preceding sections.
[explanation: Geometry Behind the Phase Transition]
Basis pursuit succeeds when the affine space $\{z:Az=y\}$ touches the $\ell^1$ ball at the true sparse vector and nowhere else. Geometrically, this asks whether a face of the $\ell^1$ ball survives under projection by $A$. The $\ell^1$ ball in $\mathbb R^d$ is the cross-polytope, whose low-dimensional faces correspond to sparse signed supports. Donoho-Tanner theory computes when random projections preserve these faces, yielding a threshold curve rather than a single order bound.
[/explanation]
The same geometry distinguishes uniqueness from stability. A measurement matrix may be injective on $k$-sparse vectors without giving a well-conditioned inverse, while stable recovery requires quantitative separation from the null space.
[remark: Uniqueness Versus Stability]
The condition $\delta_{2k}(A)<1$ guarantees uniqueness of $k$-sparse solutions in the noiseless problem. Basis pursuit recovery and stable recovery require stronger quantitative conditions, such as RIP with a smaller constant or a robust null-space property. This distinction mirrors the difference between identifiability and estimability in statistical experiments.
[/remark]
Compressed sensing therefore fits the course's minimax narrative. Random matrix concentration provides achievability, packing and Fano arguments provide lower bounds, and [convex geometry](/page/Convex%20Geometry) explains the sharper phase transitions seen by practical algorithms.
Compressed sensing shows that sparse recovery is governed by the same testing and packing principles developed earlier, but now through linear observation schemes. To analyze those schemes rigorously, we need the random matrix tools that control measurement operators, covariance fluctuations, and conditioning.
# 6. Random Matrix Preliminaries
This chapter prepares the random matrix tools used later for minimax lower bounds, covariance estimation, spiked models, and compressed sensing. Chapters 1 through 5 treated high-dimensional statistical difficulty through testing, information inequalities, and sparse linear models; here the emphasis shifts to the random linear operators that appear in those reductions and estimators. The recurring questions are spectral: how are eigenvalues distributed, how large are operator norms, and when is a random design well-conditioned on the subspaces relevant to estimation?
## Empirical Spectral Distributions and Resolvents
The first problem is how to describe the spectrum of a large matrix without tracking every eigenvalue individually. In high-dimensional statistics, a covariance estimator or Gram matrix may have thousands of eigenvalues, and its limiting behaviour is often visible only after averaging them into a probability measure.
[definition: Empirical Spectral Distribution]
Let $A_n \in \mathbb R^{n \times n}$ be a symmetric matrix with eigenvalues $\lambda_1(A_n), \dots, \lambda_n(A_n)$, counted with algebraic multiplicity. The empirical spectral distribution of $A_n$ is the probability measure
\begin{align*}
\mu_{A_n} := \frac{1}{n}\sum_{i=1}^n \delta_{\lambda_i(A_n)}.
\end{align*}
[/definition]
The empirical spectral distribution turns a matrix problem into a problem about [weak convergence](/page/Weak%20Convergence) of probability measures. To compare spectra across dimensions, we need a convergence notion that tests stable averages rather than individual eigenvalue labels.
[definition: Weak Convergence of Probability Measures]
Let $(\mu_n)_{n\ge 1}$ and $\mu$ be probability measures on $\mathbb R$. We say that $\mu_n$ converges weakly to $\mu$, written $\mu_n \xrightarrow{d} \mu$, if for every bounded [continuous function](/page/Continuous%20Function) $f:\mathbb R\to\mathbb R$,
\begin{align*}
\int_{\mathbb R} f(x)\,d\mu_n(x) \to \int_{\mathbb R} f(x)\,d\mu(x).
\end{align*}
[/definition]
Weak convergence is the right topology for bulk spectral limits, but direct testing against all bounded continuous functions is rarely convenient.
For spectra, the practical obstruction is that eigenvalues move with dimension and individual labels are unstable. What can be estimated robustly are smoothed averages of the empirical measure. The transform below encodes those smoothed averages in a single complex function and is designed to interact well with matrix inverse identities.
[definition: Stieltjes Transform]
Let $\mu$ be a probability measure on $\mathbb R$. Its Stieltjes transform is the function $m_\mu:\mathbb C\setminus \mathbb R\to\mathbb C$ defined by
\begin{align*}
m_\mu(z) := \int_{\mathbb R} \frac{1}{x-z}\,d\mu(x).
\end{align*}
[/definition]
The imaginary part of $m_\mu(z)$ contains smoothed information about the mass of $\mu$ near $\operatorname{Re}(z)$.
For an empirical spectral distribution, however, the transform should be computed from the matrix without diagonalising it every time. The obstruction is that eigenvalue formulas are not well suited to perturbation and concentration arguments, while inverses of shifted matrices satisfy useful algebraic identities. This leads to the matrix-valued object whose trace reproduces the transform.
[definition: Resolvent]
Let $A\in\mathbb R^{n\times n}$ be symmetric. The resolvent of $A$ is the map
\begin{align*}
R_A:\mathbb C\setminus\mathbb R&\to \mathbb C^{n\times n}, & z&\mapsto (A-zI)^{-1}.
\end{align*}
[/definition]
Resolvents are useful because they retain the full spectral information while being more amenable to algebraic identities and concentration arguments.
The remaining point is to connect the two languages exactly. The Stieltjes transform is defined from the empirical measure, while the resolvent is defined from the matrix; without an identity between them, estimates on matrix inverses would not automatically say anything about spectral distributions. The following theorem supplies that finite-dimensional bridge.
[quotetheorem:5923]
[citeproof:5923]
The trace formula explains why random matrix arguments often estimate diagonal entries or traces of resolvents. The symmetry hypothesis is essential here because it gives a real spectral decomposition and a probability measure on real eigenvalues; for non-normal matrices, eigenvalues alone do not control resolvents in the same stable way. The restriction $z\in\mathbb C\setminus\mathbb R$ keeps the inverse uniformly away from the spectrum, so the formula avoids singularities at eigenvalues. The identity is exact at finite $n$, but by itself it is only a change of representation; to make it a convergence criterion, we need to know that convergence of the transforms forces convergence of the underlying measures.
[quotetheorem:5924]
[citeproof:5924]
This criterion reduces a measure convergence problem to pointwise convergence of analytic quantities. Probability measures on $\mathbb R$ matter here because empirical spectral distributions of symmetric matrices live on the real line and have total mass one, so tightness is built into the spectral normalisation. Knowing the transform on $\mathbb C\setminus\mathbb R$ is enough because its imaginary part is the Poisson smoothing of the measure, and the boundary behaviour recovers interval masses at continuity points. The theorem does not say that an arbitrary pointwise limit of Stieltjes transforms is automatically the transform of a probability measure: for instance, the functions $2m_\mu(z)$ have the right analyticity and sign pattern but correspond to total mass $2$, while the constant function $0$ would correspond to zero mass rather than a probability measure. Boundary information is also not optional. If two point masses $\delta_0$ and $\delta_\eta$ are observed only through values of their transforms at a fixed height $\operatorname{Im} z=1$, their Poisson-smoothed densities are close when $\eta$ is small, but their interval masses on $(-\eta/2,\eta/2)$ differ. The inversion step is what prevents such smoothed agreement from being mistaken for weak convergence. The finite-dimensional example below shows how empirical spectra can converge before we encounter genuinely random ensembles.
[example: Empirical Spectrum of a Diagonal Matrix]
Let $A_n=\operatorname{diag}(1/n,2/n,\dots,n/n)$. Since $A_n$ is diagonal, its eigenvalues are the diagonal entries $1/n,2/n,\dots,n/n$, counted with multiplicity one, so its empirical spectral distribution is
\begin{align*}
\mu_{A_n}=\frac{1}{n}\sum_{i=1}^n\delta_{i/n}.
\end{align*}
We show that $\mu_{A_n}$ converges weakly to the uniform distribution on $[0,1]$. Let $g:\mathbb R\to\mathbb R$ be bounded and continuous. Then
\begin{align*}
\int_{\mathbb R}g(x)\,d\mu_{A_n}(x)=\frac{1}{n}\sum_{i=1}^n g(i/n).
\end{align*}
The uniform distribution on $[0,1]$ has density $1$ on $[0,1]$, hence
\begin{align*}
\int_{\mathbb R}g(x)\,d\operatorname{Unif}[0,1](x)=\int_0^1 g(x)\,dx.
\end{align*}
The restriction of $g$ to $[0,1]$ is continuous on a compact interval, so it is Riemann integrable. Therefore the right-endpoint Riemann sums converge:
\begin{align*}
\frac{1}{n}\sum_{i=1}^n g(i/n)=\sum_{i=1}^n g(i/n)\frac{1}{n}\longrightarrow \int_0^1 g(x)\,dx.
\end{align*}
By the definition of weak convergence, $\mu_{A_n}\xrightarrow{d}\operatorname{Unif}[0,1]$. In this example, the individual eigenvalues form an increasingly fine grid, while the empirical spectral distribution records their limiting bulk average.
[/example]
## Gaussian and Sub-Gaussian Matrix Ensembles
The next problem is to identify the random matrices that arise from data. A design matrix produces a sample covariance matrix, a symmetric noise model produces a Wigner matrix, and non-Gaussian data often require sub-Gaussian assumptions that retain Gaussian-type concentration.
[definition: Sub-Gaussian Random Variable]
A real-valued random variable $X$ is sub-Gaussian with parameter $K>0$ if
\begin{align*}
\mathbb E[e^{t(X-\mathbb E[X])}] \le \exp\left(\frac{K^2t^2}{2}\right)
\end{align*}
for every $t\in\mathbb R$.
[/definition]
Sub-Gaussianity is the tail condition that allows Gaussian proofs to survive under weaker assumptions.
For random matrix rows, a scalar tail bound on each coordinate is not enough: a bad linear combination could still have heavy tails or the covariance could distort Euclidean lengths before any sampling error appears. Since covariance estimation compares empirical quadratic forms with Euclidean lengths, the row model must also remove any deterministic anisotropy in the population covariance. The following definition builds both requirements into one hypothesis: all linear projections have sub-Gaussian tails, and the population covariance is already normalised to the identity.
[definition: Isotropic Sub-Gaussian Random Vector]
A random vector $X\in\mathbb R^d$ is isotropic sub-Gaussian with parameter $K>0$ if $\mathbb E[X]=0$, $\mathbb E[XX^\top]=I_d$, and $u\cdot X$ is sub-Gaussian with parameter $K|u|$ for every $u\in\mathbb R^d$.
[/definition]
Isotropy normalises the population covariance so that the target geometry is Euclidean. The next object measures how well the empirical geometry obtained from independent rows approximates that population geometry.
[definition: Sample Covariance Matrix]
Let $n,d\in\mathbb N$. The sample covariance transformation is the map
\begin{align*}
\widehat{\Sigma}_{n,d}:(\mathbb R^d)^n&\to \mathbb R^{d\times d}_{\mathrm{sym}}, &
(X_1,\dots,X_n)&\mapsto \frac{1}{n}\sum_{i=1}^n X_iX_i^\top,
\end{align*}
where $\mathbb R^{d\times d}_{\mathrm{sym}}$ denotes the symmetric $d\times d$ matrices. The data-matrix form of the sample covariance transformation is
\begin{align*}
\widehat{\Sigma}^{\mathrm{mat}}_{n,d}:\mathbb R^{n\times d}&\to \mathbb R^{d\times d}_{\mathrm{sym}}, &
X&\mapsto \frac{1}{n}X^\top X.
\end{align*}
[/definition]
Sample covariance matrices are the main random matrices in regression and covariance estimation. To see the fixed-direction behaviour before taking a spectral supremum, it helps to inspect the Gaussian case.
[example: Gaussian Sample Covariance]
Let $X_1,\dots,X_n\overset{\text{i.i.d.}}{\sim}\mathcal N(0,I_d)$ and set
\begin{align*}
\hat{\Sigma}=\frac{1}{n}\sum_{i=1}^n X_iX_i^\top.
\end{align*}
For any fixed $u\in\mathbb R^d$ with $|u|=1$, we compute the quadratic form in direction $u$. By linearity of matrix multiplication over sums,
\begin{align*}
u^\top\hat{\Sigma}u=u^\top\left(\frac{1}{n}\sum_{i=1}^n X_iX_i^\top\right)u=\frac{1}{n}\sum_{i=1}^n u^\top X_iX_i^\top u.
\end{align*}
For each $i$, the middle scalar factors as
\begin{align*}
u^\top X_iX_i^\top u=(u^\top X_i)(X_i^\top u).
\end{align*}
Since $u^\top X_i=X_i^\top u=u\cdot X_i$, this gives
\begin{align*}
u^\top\hat{\Sigma}u=\frac{1}{n}\sum_{i=1}^n (u\cdot X_i)^2.
\end{align*}
Because $X_i\sim\mathcal N(0,I_d)$, the scalar projection $u\cdot X_i$ is Gaussian with mean
\begin{align*}
\mathbb E[u\cdot X_i]=u\cdot \mathbb E[X_i]=0
\end{align*}
and variance
\begin{align*}
\operatorname{Var}(u\cdot X_i)=u^\top I_d u=u^\top u=|u|^2=1.
\end{align*}
Therefore $u\cdot X_i\sim\mathcal N(0,1)$, so $(u\cdot X_i)^2\sim\chi^2_1$. Independence of $X_1,\dots,X_n$ implies independence of the scalar projections $u\cdot X_1,\dots,u\cdot X_n$, and hence of their squares. Thus $u^\top\hat{\Sigma}u$ is the average of $n$ independent $\chi^2_1$ variables.
For $Z\sim\mathcal N(0,1)$, the standard Gaussian moments give $\mathbb E[Z^2]=1$ and $\mathbb E[Z^4]=3$, so
\begin{align*}
\operatorname{Var}(Z^2)=\mathbb E[Z^4]-(\mathbb E[Z^2])^2=3-1=2.
\end{align*}
Consequently,
\begin{align*}
\mathbb E[u^\top\hat{\Sigma}u]=\frac{1}{n}\sum_{i=1}^n \mathbb E[(u\cdot X_i)^2]=\frac{1}{n}\sum_{i=1}^n 1=1.
\end{align*}
Using independence of the summands,
\begin{align*}
\operatorname{Var}(u^\top\hat{\Sigma}u)=\operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n (u\cdot X_i)^2\right)=\frac{1}{n^2}\sum_{i=1}^n 2=\frac{2}{n}.
\end{align*}
So for each fixed unit direction $u$, the quadratic form $u^\top\hat{\Sigma}u$ is centered at $1$ with variance $2/n$. The spectral question is harder because it asks for this fixed-direction control uniformly over all $u\in S^{d-1}$.
[/example]
The sample covariance ensemble is rectangular before forming $X^\top X$.
A different spectral model is needed when the randomness is already symmetric and represents pairwise noise rather than observations of feature vectors. In that setting, independence can only be imposed on the upper-triangular entries, and the variance must be scaled so that the eigenvalues remain of constant size as $n$ grows. The following definition isolates this canonical symmetric ensemble.
[definition: Wigner Matrix]
A symmetric random matrix $W\in\mathbb R^{n\times n}$ is a Wigner matrix if the entries $(W_{ij})_{1\le i<j\le n}$ are independent, centred, identically distributed random variables, the diagonal entries are centred and independent of the off-diagonal entries, there is a constant $C>0$ independent of $n$ such that $\mathbb E[W_{ii}^2]\le C/n$ for every $i$, and the matrix is scaled so that $\mathbb E[W_{ij}^2]=1/n$ for $i\ne j$.
[/definition]
The $1/n$ variance scaling keeps the spectrum in a bounded interval as $n$ grows. We also need a name for the Gaussian members of these rectangular and symmetric families, because they provide the sharp benchmark inequalities used throughout the course.
[definition: Gaussian Matrix Ensemble]
A Gaussian matrix ensemble is a random matrix whose specified independent entries are Gaussian random variables. In the rectangular design case, $G\in\mathbb R^{n\times d}$ has independent entries $G_{ij}\sim\mathcal N(0,1)$. In the symmetric Wigner case, the independent upper-triangular entries are Gaussian with the variance scaling prescribed by the ensemble.
[/definition]
Gaussian ensembles are exactly solvable enough to provide benchmarks, while sub-Gaussian ensembles give the robustness needed in statistics. The next example records the scalar building blocks behind both classes.
[example: Rademacher and Gaussian Variables]
Let $\varepsilon$ take values $-1$ and $1$ with probability $1/2$ each. Its mean is
\begin{align*}
\mathbb E[\varepsilon]=(-1)\frac12+1\frac12=0.
\end{align*}
For every $t\in\mathbb R$,
\begin{align*}
\mathbb E[e^{t\varepsilon}]=e^{-t}\frac12+e^t\frac12=\frac{e^t+e^{-t}}{2}=\cosh(t).
\end{align*}
Using the [power series](/page/Power%20Series) for $\cosh$,
\begin{align*}
\cosh(t)=\sum_{k=0}^\infty \frac{t^{2k}}{(2k)!}.
\end{align*}
For each $k\ge 0$,
\begin{align*}
(2k)! = \prod_{j=1}^k (2j-1)(2j)\ge \prod_{j=1}^k 2j=2^k k!.
\end{align*}
Therefore
\begin{align*}
\cosh(t)\le \sum_{k=0}^\infty \frac{t^{2k}}{2^k k!}.
\end{align*}
The right-hand side is
\begin{align*}
\sum_{k=0}^\infty \frac{t^{2k}}{2^k k!}=\sum_{k=0}^\infty \frac{(t^2/2)^k}{k!}=e^{t^2/2}.
\end{align*}
Hence
\begin{align*}
\mathbb E[e^{t(\varepsilon-\mathbb E[\varepsilon])}]=\mathbb E[e^{t\varepsilon}]\le e^{t^2/2}.
\end{align*}
By the definition of a sub-Gaussian random variable, $\varepsilon$ is sub-Gaussian with parameter $1$.
Now let $Z\sim\mathcal N(0,\sigma^2)$ with $\sigma>0$. Then $\mathbb E[Z]=0$, and its density is $(\sqrt{2\pi}\sigma)^{-1}\exp(-z^2/(2\sigma^2))$. Thus
\begin{align*}
\mathbb E[e^{tZ}]=\frac{1}{\sqrt{2\pi}\sigma}\int_{\mathbb R}\exp\left(tz-\frac{z^2}{2\sigma^2}\right)\,dz.
\end{align*}
Complete the square in the exponent:
\begin{align*}
tz-\frac{z^2}{2\sigma^2}=-\frac{z^2-2\sigma^2tz}{2\sigma^2}.
\end{align*}
Since
\begin{align*}
z^2-2\sigma^2tz=(z-\sigma^2t)^2-\sigma^4t^2,
\end{align*}
we get
\begin{align*}
tz-\frac{z^2}{2\sigma^2}=-\frac{(z-\sigma^2t)^2}{2\sigma^2}+\frac{\sigma^2t^2}{2}.
\end{align*}
Substituting this identity into the integral gives
\begin{align*}
\mathbb E[e^{tZ}]=e^{\sigma^2t^2/2}\frac{1}{\sqrt{2\pi}\sigma}\int_{\mathbb R}\exp\left(-\frac{(z-\sigma^2t)^2}{2\sigma^2}\right)\,dz.
\end{align*}
The remaining integral is the total mass of a $\mathcal N(\sigma^2t,\sigma^2)$ density, so
\begin{align*}
\frac{1}{\sqrt{2\pi}\sigma}\int_{\mathbb R}\exp\left(-\frac{(z-\sigma^2t)^2}{2\sigma^2}\right)\,dz=1.
\end{align*}
Therefore
\begin{align*}
\mathbb E[e^{t(Z-\mathbb E[Z])}]=\mathbb E[e^{tZ}]=e^{\sigma^2t^2/2}.
\end{align*}
By the definition of a sub-Gaussian random variable, $Z$ is sub-Gaussian with parameter $\sigma$. These computations show that the sub-Gaussian class contains both bounded discrete signs and continuous Gaussian noise.
[/example]
## Singular Values, Operator Norms, and Conditioning
The central finite-sample question is how much a random matrix can stretch or shrink a vector. For estimation, shrinkage is as important as expansion: a design matrix with a very small minimum singular value makes inverse problems unstable even if its largest singular value is controlled.
[definition: Singular Values and Operator Norm]
Let $n,d\in\mathbb N$ and set $r=\min(n,d)$. For each $k\in\{1,\dots,r\}$, the $k$th singular value is the map
\begin{align*}
s_k:\mathbb R^{n\times d}&\to [0,\infty), & A&\mapsto \lambda_k\bigl((A^\top A)^{1/2}\bigr),
\end{align*}
where the eigenvalues of $(A^\top A)^{1/2}$ are listed in non-increasing order:
\begin{align*}
s_1(A)\ge s_2(A)\ge \dots \ge s_r(A)\ge 0.
\end{align*}
The operator norm is the map
\begin{align*}
\|\cdot\|_{\mathrm{op}}:\mathbb R^{n\times d}&\to [0,\infty), &
A&\mapsto \sup_{u\in\mathbb R^d,\ |u|=1}|Au| = s_1(A).
\end{align*}
[/definition]
The largest singular value measures worst-case amplification, while the smallest singular value measures stability of inversion on the image. A tall design can therefore be dangerous in two opposite ways: it may magnify noise in some direction, or it may nearly collapse another direction so that least-squares inversion becomes unstable.
For least-squares stability, the obstruction is the coexistence of these two effects. Multiplying all entries of $A$ by the same nonzero scalar changes both extremes but not the intrinsic ill-conditioning of the inverse problem, so the relevant quantity must compare the extremes rather than record their absolute size.
[definition: Condition Number]
Let $n,d\in\mathbb N$ with $n\ge d$, and let
\begin{align*}
\mathcal F_{n,d}:=\{A\in\mathbb R^{n\times d}: \operatorname{rank}(A)=d\}.
\end{align*}
The condition number is the map
\begin{align*}
\kappa:\mathcal F_{n,d}&\to [1,\infty), &
A&\mapsto \frac{s_1(A)}{s_d(A)}.
\end{align*}
[/definition]
A condition number close to $1$ means that $A$ acts almost as a scaled isometry on $\mathbb R^d$. For random designs the obstruction is that both extremes of the singular spectrum can fluctuate: one column direction may be amplified while another is nearly lost. To prove that Gaussian least-squares problems are well-conditioned, one needs simultaneous non-asymptotic bounds on the largest and smallest singular values at the correct $\sqrt n\pm\sqrt d$ scale.
[quotetheorem:5925]
[citeproof:5925]
These bounds give non-asymptotic control with the correct first-order scale. Gaussianity enters in two separate ways: it gives concentration for Lipschitz functions of the entries, and it allows Gordon comparison to locate the expectations at the $\sqrt n\pm\sqrt d$ scale. Independence and the $\mathcal N(0,1)$ normalisation fix the covariance of each row and determine the singular-value scale; changing the variance rescales the conclusion. The assumptions cannot be dropped without changing the statement. If every entry is multiplied by $\sigma$, then $s_1$ and $s_d$ are multiplied by $\sigma$, so the displayed thresholds are wrong unless $\sigma=1$. If all columns of $G$ are forced to be identical, independence across entries fails and the rank is at most $1$, so $s_d(G)=0$ for $d\ge 2$ even when $n>d$. If entries have heavy tails, a single unusually large entry can make $\|G\|_{\mathrm{op}}$ much larger than $\sqrt n+\sqrt d+t$ with probability not controlled by $e^{-t^2/2}$. The theorem is not an empirical spectral distribution result, since it controls only the extreme singular values, and it does not directly cover sub-Gaussian ensembles without additional comparison or net arguments. It does, however, translate directly into statements about the empirical covariance matrix $n^{-1}G^\top G$, whose eigenvalues are $s_i(G)^2/n$.
[example: Largest Singular Value of a Gaussian Matrix]
Let $G\in\mathbb R^{n\times d}$ have independent $\mathcal N(0,1)$ entries, and fix $0<\delta\le 1$. By the definition of the operator norm through singular values,
\begin{align*}
\|G\|_{\mathrm{op}}=s_1(G).
\end{align*}
Set
\begin{align*}
t=\sqrt{2\log(1/\delta)}.
\end{align*}
Since $0<\delta\le 1$, we have $\log(1/\delta)\ge 0$, so $t\ge 0$ and the upper-tail part of *[Davidson-Szarek Gaussian Singular Value Bounds](/theorems/5925)* applies. It gives
\begin{align*}
\mathbb P\left(s_1(G)\ge \sqrt n+\sqrt d+\sqrt{2\log(1/\delta)}\right)\le \exp\left(-\frac{(\sqrt{2\log(1/\delta)})^2}{2}\right).
\end{align*}
The exponent is
\begin{align*}
-\frac{(\sqrt{2\log(1/\delta)})^2}{2}=-\frac{2\log(1/\delta)}{2}=-\log(1/\delta).
\end{align*}
Because $-\log(1/\delta)=\log(\delta)$, the right-hand side is
\begin{align*}
\exp(-\log(1/\delta))=\exp(\log\delta)=\delta.
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(s_1(G)\ge \sqrt n+\sqrt d+\sqrt{2\log(1/\delta)}\right)\le \delta.
\end{align*}
Taking complements gives
\begin{align*}
\mathbb P\left(s_1(G)< \sqrt n+\sqrt d+\sqrt{2\log(1/\delta)}\right)\ge 1-\delta.
\end{align*}
The event with strict inequality is contained in the event with non-strict inequality, so using $\|G\|_{\mathrm{op}}=s_1(G)$,
\begin{align*}
\mathbb P\left(\|G\|_{\mathrm{op}}\le \sqrt n+\sqrt d+\sqrt{2\log(1/\delta)}\right)\ge 1-\delta.
\end{align*}
Thus the high-probability operator-norm scale is $\sqrt n+\sqrt d$ up to the confidence correction $\sqrt{2\log(1/\delta)}$, rather than the Frobenius scale $\sqrt{nd}$; prediction-error bounds based on operator norms are therefore governed by spectral size, not by the total entrywise energy.
[/example]
The preceding theorem is Gaussian. For broader ensembles, a more elementary method uses finite nets of the unit sphere and concentration for fixed quadratic forms, so we first define the covering object.
[definition: Epsilon Net]
Let $(T,d)$ be a metric space and let $\varepsilon>0$. A subset $N\subset T$ is an $\varepsilon$-net for $T$ if for every $x\in T$ there exists $y\in N$ such that $d(x,y)\le \varepsilon$.
[/definition]
Nets reduce an uncountable supremum to a finite maximum at the cost of approximation error. For the unit sphere, the key question is how large such a finite set must be.
[quotetheorem:5926]
[citeproof:5926]
The volumetric estimate is the combinatorial heart of many random matrix arguments. Finite-dimensionality is essential: compactness of $S^{d-1}$ gives finite nets, while in infinite-dimensional unit spheres no comparable finite covering exists in norm. A concrete failure is the unit sphere of $\ell^2$: the vectors $e_1,e_2,\dots$ are pairwise at distance $\sqrt2$, so no finite $1/2$-net can cover the sphere. The condition $\varepsilon>0$ is also structural; an exact $0$-net for $S^{d-1}$ would have to contain every point of the sphere and is infinite when $d\ge 2$. The theorem gives existence and a cardinality bound, not a uniform random matrix estimate by itself. The price is exponential dependence on $d$, so when a union bound is applied over $N$ the probability estimate must beat a factor of order $e^{Cd}$. The next step is to convert a net bound into an operator norm bound for a symmetric matrix.
[quotetheorem:5927]
[citeproof:5927]
This proof is less sharp than Gaussian comparison, but it is flexible. Symmetry is needed because the operator norm can be written as $\sup_{|u|=1}|u^\top Au|$; for rectangular or non-symmetric matrices, one must instead use bilinear forms $u^\top Av$ or apply the argument to a symmetric dilation. Without symmetry the quadratic forms can vanish while the operator norm is large: take $A\in\mathbb R^{2\times 2}$ with $A_{12}=1$, $A_{21}=-1$, and $A_{11}=A_{22}=0$. Then $u^\top Au=0$ for every $u\in S^1$, but $\|A\|_{\mathrm{op}}=1$. The net must also be fine enough. If $A=\operatorname{diag}(1,0,\dots,0)$ and a candidate net misses a neighbourhood of $e_1$, then $\max_{u\in N}|u^\top Au|$ can be far below $\|A\|_{\mathrm{op}}$; the $1/4$ covering condition is what bounds this approximation loss and allows absorption into the left-hand side. Smaller constants improve only numerical factors, not the method. Net arguments also lose sharp constants and introduce exponential covering factors, so they are strongest when fixed-direction concentration is very strong. Their statistical meaning is visible in the conditioning of tall Gaussian design matrices, where lower and upper singular value bounds combine.
[example: Conditioning of a Tall Gaussian Design]
Let $G\in\mathbb R^{n\times d}$ have independent $\mathcal N(0,1)$ entries, and assume $n>d$. Fix $t\ge 0$. By *Davidson-Szarek Gaussian Singular Value Bounds*,
\begin{align*}
\mathbb P\left(s_1(G)\ge \sqrt n+\sqrt d+t\right)\le e^{-t^2/2}.
\end{align*}
The same theorem gives
\begin{align*}
\mathbb P\left(s_d(G)\le \sqrt n-\sqrt d-t\right)\le e^{-t^2/2}.
\end{align*}
Let
\begin{align*}
E_1=\left\{s_1(G)<\sqrt n+\sqrt d+t\right\}.
\end{align*}
Let
\begin{align*}
E_d=\left\{s_d(G)>\sqrt n-\sqrt d-t\right\}.
\end{align*}
Then
\begin{align*}
\mathbb P(E_1^c)\le e^{-t^2/2}.
\end{align*}
Also,
\begin{align*}
\mathbb P(E_d^c)\le e^{-t^2/2}.
\end{align*}
Using the union bound,
\begin{align*}
\mathbb P(E_1\cap E_d)=1-\mathbb P(E_1^c\cup E_d^c).
\end{align*}
Since $\mathbb P(A\cup B)\le \mathbb P(A)+\mathbb P(B)$,
\begin{align*}
\mathbb P(E_1\cap E_d)\ge 1-\mathbb P(E_1^c)-\mathbb P(E_d^c).
\end{align*}
Therefore
\begin{align*}
\mathbb P(E_1\cap E_d)\ge 1-2e^{-t^2/2}.
\end{align*}
Now suppose $n\ge 16d$ and $t\le \sqrt n/4$. From $n\ge 16d$ we get $d\le n/16$, hence
\begin{align*}
\sqrt d\le \frac{\sqrt n}{4}.
\end{align*}
On the event $E_1\cap E_d$, the lower singular value satisfies
\begin{align*}
s_d(G)>\sqrt n-\sqrt d-t.
\end{align*}
Using $\sqrt d\le \sqrt n/4$ and $t\le \sqrt n/4$,
\begin{align*}
\sqrt n-\sqrt d-t\ge \sqrt n-\frac{\sqrt n}{4}-\frac{\sqrt n}{4}.
\end{align*}
The right-hand side is
\begin{align*}
\sqrt n-\frac{\sqrt n}{4}-\frac{\sqrt n}{4}=\frac{\sqrt n}{2}.
\end{align*}
Thus
\begin{align*}
s_d(G)>\frac{\sqrt n}{2}.
\end{align*}
In particular $s_d(G)>0$, so $G$ has full column rank and its condition number is defined.
On the same event, the upper singular value satisfies
\begin{align*}
s_1(G)<\sqrt n+\sqrt d+t.
\end{align*}
Again using $\sqrt d\le \sqrt n/4$ and $t\le \sqrt n/4$,
\begin{align*}
\sqrt n+\sqrt d+t\le \sqrt n+\frac{\sqrt n}{4}+\frac{\sqrt n}{4}.
\end{align*}
The right-hand side is
\begin{align*}
\sqrt n+\frac{\sqrt n}{4}+\frac{\sqrt n}{4}=\frac{3\sqrt n}{2}.
\end{align*}
Therefore
\begin{align*}
s_1(G)<\frac{3\sqrt n}{2}.
\end{align*}
By the definition of the condition number,
\begin{align*}
\kappa(G)=\frac{s_1(G)}{s_d(G)}.
\end{align*}
Combining the two singular-value bounds gives
\begin{align*}
\kappa(G)<\frac{3\sqrt n/2}{\sqrt n/2}.
\end{align*}
Since
\begin{align*}
\frac{3\sqrt n/2}{\sqrt n/2}=3,
\end{align*}
we have $\kappa(G)<3$ on $E_1\cap E_d$. Thus, with probability at least $1-2e^{-t^2/2}$, a Gaussian design with $n\ge 16d$ and $t\le \sqrt n/4$ distorts Euclidean lengths by a factor less than $3$ between its largest and smallest singular directions.
[/example]
## Matrix Bernstein and Non-Asymptotic Covariance Control
The last problem in the chapter is how to control sums of independent random matrices directly. This is the natural language for sample covariance matrices, empirical Hessians, and noise terms in high-dimensional estimators.
[definition: Matrix Variance Proxy]
Let $n,d\in\mathbb N$, and let $\mathcal Y_{n,d}$ be the class of $n$-tuples of independent centred symmetric random matrices in $\mathbb R^{d\times d}$. The matrix variance proxy is the functional
\begin{align*}
v_{n,d}:\mathcal Y_{n,d}&\to [0,\infty), &
(Y_1,\dots,Y_n)&\mapsto \left\|\sum_{i=1}^n \mathbb E[Y_i^2]\right\|_{\mathrm{op}}.
\end{align*}
[/definition]
For a tuple $(Y_1,\dots,Y_n)\in\mathcal Y_{n,d}$, we write $\sigma^2:=v_{n,d}(Y_1,\dots,Y_n)$. The variance proxy measures the size of the accumulated second moments in the direction where they are largest.
The obstruction is that scalar Bernstein cannot be applied direction by direction without paying for the continuum of unit vectors, and noncommuting summands make entrywise variance an incomplete measure of spectral fluctuation. The needed result must combine this operator-valued variance scale with a uniform bound that prevents one summand from dominating the whole sum.
[quotetheorem:5928]
[citeproof:5928]
Matrix Bernstein is often the workhorse behind finite-sample covariance bounds for bounded or truncated observations. Independence is needed for the trace Laplace-transform argument; if $Y_1=\cdots=Y_n=Z$ for a centred random matrix $Z$, then the sum is $nZ$ and the variance and tail scale are different from the independent case. Centering is needed so that the first-order drift vanishes; if $Y_i=I_d$ deterministically, then the left-hand side is $n$ with probability one although the centred-variance term is not the right description of the deviation. The almost-sure bound $\|Y_i\|_{\mathrm{op}}\le L$ is also structural: if $Y_1$ is a rank-one matrix with a heavy-tailed scalar coefficient, then a single summand can dominate the sum and no Bernstein tail with denominator $\sigma^2+Lt/3$ is available for finite $L$. The prefactor $2d$ records the cost of controlling all spectral directions at once; for diagonal matrices with independent scalar entries, controlling $\max_{1\le j\le d}|S_j|$ already requires a dimension-dependent union cost. The following example shows how a covariance deviation is written as a sum of centred random matrices.
[example: Bounded Isotropic Covariance Estimation]
Let $X_1,\dots,X_n\in\mathbb R^d$ be independent isotropic random vectors satisfying $|X_i|\le R$ almost surely, and set
\begin{align*}
\hat{\Sigma}=\frac{1}{n}\sum_{i=1}^nX_iX_i^\top.
\end{align*}
Define
\begin{align*}
Y_i=\frac{1}{n}(X_iX_i^\top-I_d).
\end{align*}
Since isotropy means $\mathbb E[X_iX_i^\top]=I_d$, each $Y_i$ is centred:
\begin{align*}
\mathbb E[Y_i]=\frac{1}{n}\left(\mathbb E[X_iX_i^\top]-I_d\right)=\frac{1}{n}(I_d-I_d)=0.
\end{align*}
Also,
\begin{align*}
\sum_{i=1}^nY_i=\frac{1}{n}\sum_{i=1}^n(X_iX_i^\top-I_d).
\end{align*}
Expanding the right-hand side gives
\begin{align*}
\frac{1}{n}\sum_{i=1}^n(X_iX_i^\top-I_d)=\frac{1}{n}\sum_{i=1}^nX_iX_i^\top-\frac{1}{n}\sum_{i=1}^nI_d.
\end{align*}
Since $\sum_{i=1}^n I_d=nI_d$, this becomes
\begin{align*}
\sum_{i=1}^nY_i=\hat{\Sigma}-I_d.
\end{align*}
We next verify the boundedness condition needed for *[Matrix Bernstein Inequality](/theorems/5928)*. For any $x\in\mathbb R^d$,
\begin{align*}
\|xx^\top\|_{\mathrm{op}}=|x|^2.
\end{align*}
Indeed, if $|u|=1$, then $xx^\top u=x(x^\top u)$, so
\begin{align*}
|xx^\top u|=|x||x^\top u|\le |x|^2,
\end{align*}
and equality is attained at $u=x/|x|$ when $x\ne 0$. Hence, using $|X_i|\le R$ and $\|I_d\|_{\mathrm{op}}=1$,
\begin{align*}
\|Y_i\|_{\mathrm{op}}=\frac{1}{n}\|X_iX_i^\top-I_d\|_{\mathrm{op}}.
\end{align*}
By the triangle inequality for the operator norm,
\begin{align*}
\|X_iX_i^\top-I_d\|_{\mathrm{op}}\le \|X_iX_i^\top\|_{\mathrm{op}}+\|I_d\|_{\mathrm{op}}\le R^2+1.
\end{align*}
Therefore
\begin{align*}
\|Y_i\|_{\mathrm{op}}\le \frac{R^2+1}{n}.
\end{align*}
Now compute the variance proxy. First,
\begin{align*}
Y_i^2=\frac{1}{n^2}(X_iX_i^\top-I_d)^2.
\end{align*}
Expanding the square gives
\begin{align*}
(X_iX_i^\top-I_d)^2=X_iX_i^\top X_iX_i^\top-2X_iX_i^\top+I_d.
\end{align*}
Since
\begin{align*}
X_iX_i^\top X_iX_i^\top=X_i(X_i^\top X_i)X_i^\top=|X_i|^2X_iX_i^\top,
\end{align*}
we get
\begin{align*}
\mathbb E[Y_i^2]=\frac{1}{n^2}\left(\mathbb E[|X_i|^2X_iX_i^\top]-2\mathbb E[X_iX_i^\top]+I_d\right).
\end{align*}
Using isotropy again, this becomes
\begin{align*}
\mathbb E[Y_i^2]=\frac{1}{n^2}\left(\mathbb E[|X_i|^2X_iX_i^\top]-I_d\right).
\end{align*}
For any unit vector $u\in\mathbb R^d$,
\begin{align*}
u^\top\mathbb E[|X_i|^2X_iX_i^\top]u=\mathbb E[|X_i|^2(u^\top X_i)^2].
\end{align*}
Because $|X_i|^2\le R^2$ almost surely,
\begin{align*}
\mathbb E[|X_i|^2(u^\top X_i)^2]\le R^2\mathbb E[(u^\top X_i)^2].
\end{align*}
Isotropy gives $\mathbb E[(u^\top X_i)^2]=u^\top I_du=1$, so
\begin{align*}
u^\top\mathbb E[Y_i^2]u\le \frac{R^2-1}{n^2}\le \frac{R^2}{n^2}.
\end{align*}
Thus
\begin{align*}
\|\mathbb E[Y_i^2]\|_{\mathrm{op}}\le \frac{R^2}{n^2}.
\end{align*}
Summing over $i$ gives
\begin{align*}
\left\|\sum_{i=1}^n\mathbb E[Y_i^2]\right\|_{\mathrm{op}}\le \sum_{i=1}^n\|\mathbb E[Y_i^2]\|_{\mathrm{op}}\le \frac{R^2}{n}.
\end{align*}
Applying *Matrix Bernstein Inequality* with
\begin{align*}
L=\frac{R^2+1}{n}
\end{align*}
and
\begin{align*}
\sigma^2\le \frac{R^2}{n}
\end{align*}
gives, for every $s\ge 0$,
\begin{align*}
\mathbb P\left(\|\hat{\Sigma}-I_d\|_{\mathrm{op}}\ge s\right)\le 2d\exp\left(-\frac{s^2/2}{R^2/n+(R^2+1)s/(3n)}\right).
\end{align*}
Thus the covariance error is controlled by the accumulated matrix variance $R^2/n$ and the single-summand size $(R^2+1)/n$; when the radius terms are bounded, the leading deviation scale is $\sqrt{(\log d)/n}$.
[/example]
Finite-sample inequalities give high-probability bounds at fixed dimensions. This motivates the following theorem, which identifies the sharp limiting edge of large rectangular matrices. It also separates what non-asymptotic concentration can guarantee at finite $n$ from the almost sure constants that emerge when $n$ and $d$ grow together.
[quotetheorem:5929]
In these notes we use the theorem as a benchmark showing that the non-asymptotic Gaussian bounds have the correct leading constants. Its statistical message is that the proportional regime has deterministic edge locations, so covariance and PCA procedures must compare their signals with those edge scales rather than with the fixed-dimensional limit.
[remark: Bulk Versus Edge]
The empirical spectral distribution of $n^{-1}X_n^\top X_n$ describes the bulk eigenvalue distribution, while the Bai-Yin theorem describes the limiting edges. In statistical applications, both matter: the bulk controls average spectral functionals, whereas the lower edge controls invertibility and the upper edge controls worst-case amplification.
[/remark]
These preliminaries set up the random matrix inputs for the rest of the course. Minimax lower bounds will use random matrices to build hard instances and compare distributions, while upper bounds will rely on concentration of covariance matrices, restricted operator norms, and conditioning of random designs.
The random matrix preliminaries supply the concentration and spectral facts needed throughout the rest of the course. We now apply them to sample covariance matrices, where the Marchenko-Pastur law explains the limiting spectral shape that underlies much of modern high-dimensional inference.
# 7. Marchenko-Pastur Law and Sample Covariance Matrices
Chapters 1 through 5 developed minimax lower bounds, and Chapter 6 introduced the spectral and concentration tools for random matrices. This chapter assumes the basic spectral theorem for symmetric matrices, weak convergence of probability measures, and standard concentration tools for sums and quadratic forms. We now turn to the random matrix mechanism behind many of those statistical rates: even when the population covariance is the identity, the empirical covariance matrix has a non-degenerate spectrum once the dimension and sample size are comparable. The central question is how the eigenvalues of $X^\top X/n$ behave when $d/n \to \gamma$, and what this says about covariance estimation, principal components, and regularisation.
## Empirical Spectrum in the Proportional Limit
What should replace the law of large numbers for a covariance matrix whose dimension grows with the sample size? In fixed dimension, $X^\top X/n$ converges entrywise to the population covariance, and hence its eigenvalues converge to the population eigenvalues. In the proportional regime $d/n \to \gamma \in (0,\infty)$, entrywise convergence no longer controls the whole spectrum, because there are $d$ eigenvalues moving at once.
We first encode the spectrum as a probability measure, since this makes convergence of all eigenvalues into convergence of measures.
[definition: Empirical Spectral Distribution]
Let $\mathcal S_d$ be the set of symmetric matrices in $\mathbb R^{d\times d}$. The empirical spectral distribution map is the function
\begin{align*}
\operatorname{ESD}_d: \mathcal S_d \to \mathcal P(\mathbb R).
\end{align*}
It sends a matrix $A_d$ to
\begin{align*}
\mu_{A_d}:=\frac{1}{d}\sum_{j=1}^d \delta_{\lambda_j(A_d)},
\end{align*}
where $\lambda_1(A_d),\dots,\lambda_d(A_d)$ are the eigenvalues of $A_d$ counted with algebraic multiplicity and $\mathcal P(\mathbb R)$ denotes the set of probability measures on $\mathbb R$.
[/definition]
The empirical spectral distribution records the bulk of the eigenvalues, not the identity of a particular eigenvector. This is the right object for sample covariance matrices because the bulk does not collapse to a point in high dimension.
[example: Fixed Dimension Versus Proportional Dimension]
Let
\begin{align*}
A_n:=\frac{X^\top X}{n}.
\end{align*}
For fixed $d$, the $(j,k)$ entry of $A_n$ is
\begin{align*}
(A_n)_{jk}
=\frac{1}{n}\sum_{i=1}^n X_{ij}X_{ik}.
\end{align*}
If $j=k$, then $\mathbb E[X_{ij}^2]=1$, so $(A_n)_{jj}\to 1$ by the *[strong law of large numbers](/theorems/1852)*. If $j\ne k$, independence and mean zero give
\begin{align*}
\mathbb E[X_{ij}X_{ik}]
=\mathbb E[X_{ij}]\mathbb E[X_{ik}]
=0\cdot 0
=0,
\end{align*}
so $(A_n)_{jk}\to 0$ by the same law. Hence $A_n\to I_d$ entrywise. Since $d$ is fixed,
\begin{align*}
\|A_n-I_d\|_{\mathrm{op}}
\le \|A_n-I_d\|_{\mathrm F}
=\left(\sum_{j=1}^d\sum_{k=1}^d |(A_n-I_d)_{jk}|^2\right)^{1/2}
\to 0.
\end{align*}
For every eigenvalue $\lambda_j(A_n)$, the Rayleigh quotient bound gives
\begin{align*}
|\lambda_j(A_n)-1|\le \|A_n-I_d\|_{\mathrm{op}},
\end{align*}
so $\lambda_j(A_n)\to 1$ for each fixed $j$. Therefore, for every bounded continuous $\varphi$,
\begin{align*}
\int \varphi\,d\mu_{A_n}
=\frac{1}{d}\sum_{j=1}^d \varphi(\lambda_j(A_n))
\to \frac{1}{d}\sum_{j=1}^d \varphi(1)
=\varphi(1)
=\int \varphi\,d\delta_1,
\end{align*}
which is exactly $\mu_{A_n}\Rightarrow \delta_1$.
In contrast, when $d/n\to\gamma>0$, the *Marchenko-Pastur Law* gives $\mu_{A_n}\Rightarrow \mu_\gamma$, whose continuous part is supported on
\begin{align*}
[(1-\sqrt{\gamma})^2,(1+\sqrt{\gamma})^2].
\end{align*}
The length of this interval is
\begin{align*}
(1+\sqrt{\gamma})^2-(1-\sqrt{\gamma})^2
=(1+2\sqrt{\gamma}+\gamma)-(1-2\sqrt{\gamma}+\gamma)
=4\sqrt{\gamma}>0.
\end{align*}
Thus the fixed-dimensional limit collapses the spectrum to the point $1$, while the proportional-dimensional limit keeps a non-degenerate spectral bulk even though each matrix entry is still an average of well-behaved random variables.
[/example]
The example gives the qualitative phenomenon: high-dimensional sampling turns a point mass at $1$ into a whole spectral bulk. To use this in statistics, we need the exact limiting measure, including its support endpoints and any mass at zero. The following definition names the candidate law that will be shown to govern identity-covariance sample matrices.
[definition: Marchenko-Pastur Distribution]
Let $\gamma \in (0,\infty)$ and define
\begin{align*}
a_\gamma := (1-\sqrt{\gamma})^2, \qquad b_\gamma := (1+\sqrt{\gamma})^2.
\end{align*}
The Marchenko-Pastur distribution with aspect ratio $\gamma$ is the probability measure $\mu_\gamma \in \mathcal P([0,\infty))$ defined by
\begin{align*}
\mu_\gamma(B)=\int_B f_\gamma(x)\,d\mathcal L^1(x)+\left(1-\frac{1}{\gamma}\right)\mathbb{1}_{(1,\infty)}(\gamma)\delta_0(B)
\end{align*}
for every Borel set $B\subset [0,\infty)$, where
\begin{align*}
f_\gamma: [0,\infty) \to [0,\infty).
\end{align*}
For $x\ge 0$, this density is
\begin{align*}
f_\gamma(x)=\frac{1}{2\pi \gamma x}\sqrt{(b_\gamma-x)(x-a_\gamma)}\,\mathbb{1}_{[a_\gamma,b_\gamma]}(x).
\end{align*}
[/definition]
The atom at zero is not an analytic accident; it is the rank defect of a $d \times d$ matrix formed from only $n$ samples. The continuous part describes the nonzero eigenvalues after this rank constraint is accounted for. What remains unclear from the definition alone is whether the same law survives random fluctuations in all entries simultaneously, rather than merely fitting a rank count and a plausible support. The next theorem resolves this obstruction by asserting that this explicitly defined measure is not merely a candidate, but the universal spectral limit for a broad class of i.i.d. data matrices.
[quotetheorem:4070]
[citeproof:4070]
The hypotheses isolate the regime in which a universal covariance-spectrum law is expected. The mean-zero assumption removes a rank-one signal from the sample matrix; without it, the empirical covariance contains a deterministic spike coming from the mean. The variance-one assumption fixes the scale, since variance $\sigma^2$ would multiply the limiting spectrum by $\sigma^2$, and the finite fourth moment prevents rare entries from dominating the trace moments used in the proof. The theorem does not say that every eigenvalue is close to $1$, nor does weak convergence of the empirical spectral distribution by itself control the largest eigenvalue. It says instead that the bulk fills $[(1-\sqrt{\gamma})^2,(1+\sqrt{\gamma})^2]$, which motivates the separate edge theorem below and explains why covariance estimation needs more than entrywise consistency.
[example: Singularity When Dimension Exceeds Sample Size]
Let $d>n$ and let $X\in\mathbb R^{n\times d}$. The linear map represented by $X$ has domain dimension $d$ and codomain dimension $n$, so its rank is at most $n$:
\begin{align*}
\operatorname{rank}(X)\le n.
\end{align*}
Also,
\begin{align*}
\operatorname{range}(X^\top X)\subseteq \operatorname{range}(X^\top),
\end{align*}
because every vector of the form $X^\top Xv$ is $X^\top w$ with $w=Xv$. Therefore
\begin{align*}
\operatorname{rank}(X^\top X)
=\dim\operatorname{range}(X^\top X)
\le \dim\operatorname{range}(X^\top)
=\operatorname{rank}(X^\top)
=\operatorname{rank}(X)
\le n.
\end{align*}
Multiplying a matrix by the nonzero scalar $1/n$ does not change its rank, so
\begin{align*}
\operatorname{rank}\left(\frac{X^\top X}{n}\right)
=\operatorname{rank}(X^\top X)
\le n.
\end{align*}
The matrix $X^\top X/n$ is a $d\times d$ symmetric matrix. Since its rank is the number of nonzero eigenvalues counted with multiplicity, it has at most $n$ nonzero eigenvalues and hence at least $d-n$ zero eigenvalues. By the definition of the empirical spectral distribution,
\begin{align*}
\mu_{X^\top X/n}(\{0\})
\ge \frac{d-n}{d}
=\frac{d}{d}-\frac{n}{d}
=1-\frac{n}{d}.
\end{align*}
If $d/n\to\gamma>1$, then
\begin{align*}
\frac{n}{d}
=\left(\frac{d}{n}\right)^{-1}
\to \frac{1}{\gamma},
\end{align*}
and therefore
\begin{align*}
1-\frac{n}{d}\to 1-\frac{1}{\gamma}.
\end{align*}
Thus the zero eigenvalues forced by the deterministic rank constraint produce exactly the limiting atom at zero that appears in the Marchenko-Pastur distribution when $\gamma>1$.
[/example]
The support endpoints already suggest a statistical warning: even under the identity covariance model, the largest empirical eigenvalue is biased upward and the smallest nonzero eigenvalue is biased downward. The next section explains where the density and support arise through the Stieltjes transform.
## Stieltjes Transform Derivation
How can a limiting eigenvalue density be derived without explicitly counting moments of every order? The Stieltjes transform converts a probability measure on the real line into an analytic function on the upper half-plane, and the resolvent identity converts spectral questions into matrix inverse questions.
[definition: Stieltjes Transform]
Let $\mu$ be a probability measure on $\mathbb R$. Its Stieltjes transform is the function $m_\mu: \mathbb C\setminus \mathbb R \to \mathbb C$ defined by
\begin{align*}
m_\mu(z) := \int_\mathbb R \frac{1}{x-z}\,d\mu(x).
\end{align*}
[/definition]
The Stieltjes transform packages a measure into a function whose boundary values recover the measure. For empirical spectral distributions, this transform has a concrete matrix form. That form is needed because random matrix proofs estimate resolvents directly rather than tracking individual eigenvalues.
[quotetheorem:5930]
[citeproof:5930]
The resolvent formula turns the spectral problem into the problem of estimating a random normalised trace. Each hypothesis has a specific role, and the role is already tuned to the leave-one-out argument that follows. Symmetry gives an orthogonal diagonalisation with real eigenvalues, so $\mu_{A_d}$ is a probability measure on $\mathbb R$ and the trace of the resolvent is the average of scalar resolvents. It also gives the stability needed when a row or column is removed from a sample covariance matrix. If symmetry is dropped, the statement can fail before the proof begins: the real linear map $A:\mathbb R^2\to\mathbb R^2$ with $A(e_1)=e_2$ and $A(e_2)=-e_1$ has eigenvalues $i$ and $-i$, so there is no empirical spectral distribution as a probability measure on $\mathbb R$ to which the displayed real-line Stieltjes transform applies. If the theorem were rewritten over complex spectral measures, additional non-Hermitian tools would be needed because eigenvalues of non-normal matrices can be highly unstable under perturbation, while Hermitian resolvent bounds control the distance to the real spectrum. The assumption $z\notin\mathbb R$ is also needed: if $z$ equals an eigenvalue, $A_d-zI_d$ is not invertible; if $z$ is real but outside the spectrum, the formula still makes algebraic sense, but it loses the uniform stability bound
\begin{align*}
\|(A_d-zI_d)^{-1}\|_{\mathrm{op}}\le |\operatorname{Im}z|^{-1}
\end{align*}
that makes random resolvent limits robust. The theorem is therefore an identity for Hermitian spectral analysis, not a replacement for pseudospectral tools for general matrices, and its real advantage is that low-rank resolvent identities remain stable enough to pass to a deterministic limit.
The obstruction is that the entries of $(X^\top X/n-zI_d)^{-1}$ are highly dependent functions of the whole matrix, so averaging eigenvalues directly does not expose a closed limit. In the proportional regime, a self-consistent equation appears because removing one row or one column changes the resolvent by a low-rank update. The following equation is the analytic signature of the Marchenko-Pastur law.
[quotetheorem:5931]
[citeproof:5931]
The self-consistent equation contains the whole limiting distribution only after the analytic selection rules are included. The quadratic equation alone has two branches; using the wrong branch gives a function with $z m(z)$ tending to a value different from $-1$, so it cannot be the Stieltjes transform of a probability measure because every probability measure satisfies $z m_\mu(z)\to -1$ as $|z|\to\infty$ away from the real line. The sign condition is also necessary: a Stieltjes transform maps the upper half-plane to the upper half-plane under the convention $m_\mu(z)=\int (x-z)^{-1}\,d\mu(x)$, so the branch with the opposite imaginary sign cannot correspond to a positive measure. The restriction $z\in\mathbb C\setminus\mathbb R$ is not cosmetic; on the real axis the transform can have singular boundary behaviour, and at points of the support it is not defined by the integral as an ordinary finite real-valued function. The theorem does not by itself prove convergence of random matrices or identify a density; it identifies the analytic equation and the probability-measure branch that the Marchenko-Pastur transform satisfies.
The missing step is to recover the measure from the analytic transform and identify where the boundary values acquire imaginary part. The quadratic equation alone still does not tell the reader which real interval carries continuous spectral mass, whether an atom remains at zero, or how the total mass is distributed. Stieltjes inversion resolves these ambiguities by reading the density from the boundary values of the selected analytic branch.
[quotetheorem:5932]
[citeproof:5932]
This transform calculation is more than an alternative proof. It is the template for later spiked covariance models, where isolated eigenvalues are detected by perturbing a resolvent equation rather than by recomputing every moment. The assumptions in the theorem prevent three different failures. If the function solving the quadratic is not required to be a Stieltjes transform of a probability measure, the other square-root branch is also an algebraic solution but has the wrong behaviour at infinity. If the sign condition is omitted, the boundary value can have the wrong imaginary sign, producing a negative density through Stieltjes inversion. If the normalisation at infinity is omitted, the total mass is not fixed; for instance, multiplying a Stieltjes transform by a constant changes the mass encoded by its asymptotic behaviour, even though local boundary computations may still resemble density calculations.
The theorem identifies the absolutely continuous density on the interval where the boundary value has imaginary part. It does not say that all mass is absolutely continuous: when $\gamma>1$, a separate atom at zero is detected by $-\lim_{\eta\downarrow 0} i\eta m_\gamma(i\eta)$. It also does not give finite-sample eigenvalue locations or edge convergence; those require probabilistic estimates beyond analytic inversion.
[example: Reading the Support from the Quadratic Formula]
The [quadratic formula](/theorems/1301) for the Marchenko-Pastur Stieltjes equation produces the square-root factor
\begin{align*}
\sqrt{(z-a_\gamma)(z-b_\gamma)},
\end{align*}
where
\begin{align*}
a_\gamma=(1-\sqrt{\gamma})^2,
\qquad
b_\gamma=(1+\sqrt{\gamma})^2.
\end{align*}
For $\gamma=1/4$, we have
\begin{align*}
\sqrt{\gamma}
=\sqrt{\frac{1}{4}}
=\frac{1}{2}.
\end{align*}
Therefore the left endpoint is
\begin{align*}
a_{1/4}
=\left(1-\frac{1}{2}\right)^2
=\left(\frac{1}{2}\right)^2
=\frac{1}{4},
\end{align*}
and the right endpoint is
\begin{align*}
b_{1/4}
=\left(1+\frac{1}{2}\right)^2
=\left(\frac{3}{2}\right)^2
=\frac{9}{4}.
\end{align*}
Thus the limiting bulk is
\begin{align*}
[a_{1/4},b_{1/4}]
=\left[\frac{1}{4},\frac{9}{4}\right],
\end{align*}
so identity covariance can produce empirical eigenvalues near $0.25$ and near $2.25$ in the proportional regime. The spread comes from high-dimensional sampling itself, not from anisotropy in the population covariance.
[/example]
## Spectral Edges and Covariance Estimation
The empirical spectral distribution controls the bulk, but covariance estimation in operator norm is governed by the extreme eigenvalues. The next question is whether the endpoints of the Marchenko-Pastur support are only bulk endpoints or actual limits of the smallest and largest eigenvalues.
[quotetheorem:5929]
[citeproof:5929]
The theorem upgrades a statement about the average spectral distribution into a statement about operator norm error, and each assumption is visible at the edge. The finite fourth moment is close to the natural threshold for this conclusion: if the entries have heavier tails, a few unusually large entries or rows can create singular values outside the Marchenko-Pastur edge. The mean-zero assumption excludes a deterministic rank-one component; for entries $X_{ij}=Y_{ij}+m$ with $m\ne0$ and centred noise $Y_{ij}$, the matrix $X^\top X/n$ contains a rank-one contribution of order $m^2 d_n$, creating an outlier far beyond $(1+\sqrt\gamma)^2$. The variance-one assumption fixes the edge scale; if the entries have variance $\sigma^2$, then the limits become $\sigma^2(1\pm\sqrt\gamma)^2$ rather than the displayed values. The aspect-ratio assumption is also necessary: if $d_n/n\to0$, the upper edge tends to $1$ and the proportional-regime formula degenerates, while if $d_n/n$ has two different subsequential limits, the extreme eigenvalues have corresponding different subsequential limits rather than a single number.
The theorem does not give the fluctuation scale around the edge, does not describe eigenvectors, and does not handle correlated entries or non-identity population covariance without further hypotheses. Under the identity covariance model,
\begin{align*}
\left\|\frac{X^\top X}{n}-I_d\right\|_{\mathrm{op}}
\end{align*}
does not converge to zero when $d/n\to\gamma>0$; its limiting scale is determined by the larger of the upper-edge excess and the lower-edge deficit.
[example: Operator Norm Error Under Identity Covariance]
Assume $d/n\to\gamma\in(0,1)$ and set
\begin{align*}
A_n:=\frac{X^\top X}{n}.
\end{align*}
By *Bai–Yin Theorem*,
\begin{align*}
\lambda_{\min}(A_n)\to (1-\sqrt\gamma)^2
\end{align*}
almost surely, and
\begin{align*}
\lambda_{\max}(A_n)\to (1+\sqrt\gamma)^2
\end{align*}
almost surely. Since $A_n$ is symmetric, the eigenvalues of $A_n-I_d$ are $\lambda_j(A_n)-1$, so
\begin{align*}
\|A_n-I_d\|_{\mathrm{op}}
=\max_{1\le j\le d}|\lambda_j(A_n)-1|.
\end{align*}
Because $\gamma\in(0,1)$, the limiting lower edge $(1-\sqrt\gamma)^2$ is below $1$ and the limiting upper edge $(1+\sqrt\gamma)^2$ is above $1$. Hence
\begin{align*}
\|A_n-I_d\|_{\mathrm{op}}
=\max\{1-\lambda_{\min}(A_n),\,\lambda_{\max}(A_n)-1\}.
\end{align*}
Taking limits gives
\begin{align*}
\|A_n-I_d\|_{\mathrm{op}}
\to
\max\left\{1-(1-\sqrt\gamma)^2,\,(1+\sqrt\gamma)^2-1\right\}.
\end{align*}
The lower-edge deficit is
\begin{align*}
1-(1-\sqrt\gamma)^2
=1-(1-2\sqrt\gamma+\gamma)
=2\sqrt\gamma-\gamma.
\end{align*}
The upper-edge excess is
\begin{align*}
(1+\sqrt\gamma)^2-1
=(1+2\sqrt\gamma+\gamma)-1
=2\sqrt\gamma+\gamma.
\end{align*}
Since $\gamma>0$,
\begin{align*}
(2\sqrt\gamma+\gamma)-(2\sqrt\gamma-\gamma)=2\gamma>0,
\end{align*}
so the larger term is $2\sqrt\gamma+\gamma$. Therefore
\begin{align*}
\left\|\frac{X^\top X}{n}-I_d\right\|_{\mathrm{op}}
\to 2\sqrt\gamma+\gamma>0.
\end{align*}
Thus even when the population covariance is $I_d$, proportional-dimensional sampling leaves a non-vanishing operator norm error; entrywise consistency does not control the whole spectrum when the number of entries grows with $n$.
[/example]
The same distortion motivates shrinkage estimators. If the empirical eigenvalues of an identity covariance matrix are already spread out, then observed large eigenvalues tend to overstate population signal and observed small eigenvalues tend to understate it.
[example: Shrinkage Intuition from Distorted Eigenvalues]
Suppose the population covariance is $I_d$ and $d/n\approx 1/2$. By the *Marchenko-Pastur Law*, the limiting bulk has endpoints
\begin{align*}
a_{1/2}=\left(1-\sqrt{\frac{1}{2}}\right)^2.
\end{align*}
Since $\sqrt{1/2}=1/\sqrt{2}$, we have
\begin{align*}
a_{1/2}=\left(1-\frac{1}{\sqrt{2}}\right)^2.
\end{align*}
Expanding the square gives
\begin{align*}
a_{1/2}=1-\frac{2}{\sqrt{2}}+\frac{1}{2}.
\end{align*}
Because $2/\sqrt{2}=\sqrt{2}$, this is
\begin{align*}
a_{1/2}=\frac{3}{2}-\sqrt{2}\approx 0.086.
\end{align*}
Similarly,
\begin{align*}
b_{1/2}=\left(1+\sqrt{\frac{1}{2}}\right)^2.
\end{align*}
Using $\sqrt{1/2}=1/\sqrt{2}$ again,
\begin{align*}
b_{1/2}=\left(1+\frac{1}{\sqrt{2}}\right)^2.
\end{align*}
Expanding gives
\begin{align*}
b_{1/2}=1+\frac{2}{\sqrt{2}}+\frac{1}{2}.
\end{align*}
Thus
\begin{align*}
b_{1/2}=\frac{3}{2}+\sqrt{2}\approx 2.914.
\end{align*}
So an empirical eigenvalue near $2.5$ lies inside the noise bulk $[0.086,2.914]$, and therefore it need not indicate a population eigenvalue larger than $1$.
A linear shrinkage rule replaces each empirical eigenvalue $\hat\lambda_j$ by
\begin{align*}
\tilde\lambda_j=(1-\alpha)\hat\lambda_j+\alpha
\end{align*}
with $0<\alpha<1$. Subtracting the identity-covariance target $1$ gives
\begin{align*}
\tilde\lambda_j-1=(1-\alpha)\hat\lambda_j+\alpha-1.
\end{align*}
Since $\alpha-1=-(1-\alpha)$, this becomes
\begin{align*}
\tilde\lambda_j-1=(1-\alpha)\hat\lambda_j-(1-\alpha).
\end{align*}
Factoring out $1-\alpha$ gives
\begin{align*}
\tilde\lambda_j-1=(1-\alpha)(\hat\lambda_j-1).
\end{align*}
Because $0<1-\alpha<1$, shrinkage multiplies the distance from $1$ by the factor $1-\alpha$.
If $\hat\lambda_j>1$, then $\alpha(\hat\lambda_j-1)>0$, and
\begin{align*}
\tilde\lambda_j=\hat\lambda_j-\alpha(\hat\lambda_j-1).
\end{align*}
Hence $\tilde\lambda_j<\hat\lambda_j$, so an inflated empirical eigenvalue is pulled downward. If $\hat\lambda_j<1$, then $1-\hat\lambda_j>0$, and
\begin{align*}
\tilde\lambda_j-\hat\lambda_j=\alpha(1-\hat\lambda_j)>0.
\end{align*}
Hence $\tilde\lambda_j>\hat\lambda_j$, so a deflated empirical eigenvalue is pulled upward.
If the empirical covariance has spectral decomposition
\begin{align*}
\widehat\Sigma=Q\operatorname{diag}(\hat\lambda_1,\dots,\hat\lambda_d)Q^\top,
\end{align*}
then the shrinkage estimator obtained by changing only the eigenvalues is
\begin{align*}
\widetilde\Sigma=Q\operatorname{diag}(\tilde\lambda_1,\dots,\tilde\lambda_d)Q^\top.
\end{align*}
Substituting $\tilde\lambda_j=(1-\alpha)\hat\lambda_j+\alpha$ gives
\begin{align*}
\widetilde\Sigma=Q\operatorname{diag}((1-\alpha)\hat\lambda_1+\alpha,\dots,(1-\alpha)\hat\lambda_d+\alpha)Q^\top.
\end{align*}
Using linearity of the diagonal map,
\begin{align*}
\operatorname{diag}((1-\alpha)\hat\lambda_1+\alpha,\dots,(1-\alpha)\hat\lambda_d+\alpha)=(1-\alpha)\operatorname{diag}(\hat\lambda_1,\dots,\hat\lambda_d)+\alpha I_d.
\end{align*}
Therefore
\begin{align*}
\widetilde\Sigma=Q\bigl((1-\alpha)\operatorname{diag}(\hat\lambda_1,\dots,\hat\lambda_d)+\alpha I_d\bigr)Q^\top.
\end{align*}
Distributing the multiplication by $Q$ and $Q^\top$ gives
\begin{align*}
\widetilde\Sigma=(1-\alpha)Q\operatorname{diag}(\hat\lambda_1,\dots,\hat\lambda_d)Q^\top+\alpha QI_dQ^\top.
\end{align*}
Since $QI_dQ^\top=QQ^\top=I_d$, this reduces to
\begin{align*}
\widetilde\Sigma=(1-\alpha)\widehat\Sigma+\alpha I_d.
\end{align*}
Thus the eigenvectors are still the columns of $Q$, while the eigenvalues are moved toward the identity-covariance target $1$.
[/example]
In statistical terms, the Marchenko-Pastur law separates two effects that are often conflated. The population covariance determines the target spectrum, while high-dimensional sampling creates a deterministic spectral deformation around that target. Later spiked models exploit this distinction: a population eigenvalue must exceed a threshold before it separates from the Marchenko-Pastur bulk and becomes reliably detectable.
The Marchenko-Pastur theory describes how high-dimensional sampling distorts covariance spectra in a deterministic way. That spectral deformation is exactly what makes covariance estimation and PCA delicate, so the next chapter turns to non-asymptotic error bounds and eigenspace perturbation theory.
# 8. Covariance Estimation and PCA in High Dimensions
This chapter has three course goals. First, it develops non-asymptotic covariance concentration bounds in operator and Frobenius norms. Second, it shows how deterministic eigenspace perturbation theory turns covariance error into PCA error. Third, it explains why proportional-dimensional random matrix effects change the interpretation of sample eigenvalues and eigenvectors. The prerequisites are the minimax testing tools from Chapters 1 through 3, the sub-Gaussian concentration and spectral theory from Chapter 6, the Marchenko-Pastur edge from Chapter 7, and standard facts about Gaussian covariance models.
High-dimensional covariance estimation asks how accurately the second-order structure of a distribution can be learned when the ambient dimension $p$ is comparable to, or larger than, the sample size $n$. Earlier chapters developed minimax lower bounds and random matrix tools separately; here they meet in the analysis of the sample covariance matrix and principal component analysis. The central tension is that entrywise averages may be accurate while spectral quantities, such as eigenvalues and eigenspaces, still fluctuate at a scale controlled by the effective dimension of the distribution. The same perturbation viewpoint also links this chapter to numerical linear algebra, inverse problems, and functional analysis: in each setting, the stable recovery of a low-dimensional spectral object depends on a gap separating the object from unresolved noise.
We work throughout with i.i.d. random vectors $X_1,\dots,X_n \in \mathbb R^p$ satisfying $\mathbb E[X_i]=0$ and covariance matrix $\Sigma = \mathbb E[X_iX_i^\top]$. The sample covariance matrix is
\begin{align*}
\widehat\Sigma = \frac{1}{n}\sum_{i=1}^n X_iX_i^\top.
\end{align*}
The chapter studies three linked questions: how large $\|\widehat\Sigma-\Sigma\|_{\mathrm{op}}$ and $\|\widehat\Sigma-\Sigma\|_{F}$ are, how these errors perturb eigenspaces, and why classical PCA can fail in regimes where the aspect ratio satisfies
\begin{align*}
\liminf_{n\to\infty}\frac{p}{n}>0.
\end{align*}
## Spectral and Frobenius Losses for Covariance Estimation
The first issue is choosing a loss that matches the statistical task. Estimating every covariance entry is different from estimating the largest variance direction, so the Frobenius norm and operator norm lead to different rates and different notions of dimension.
[definition: Matrix Norms for Covariance Error]
The operator norm is the map
\begin{align*}
\|\cdot\|_{\mathrm{op}}: \mathbb R^{p\times p} \to \mathbb R_{\ge 0},
\qquad
A \mapsto \|A\|_{\mathrm{op}} = \sup_{|v|=1}|Av|.
\end{align*}
The Frobenius norm is the map
\begin{align*}
\|\cdot\|_{F}: \mathbb R^{p\times p} \to \mathbb R_{\ge 0},
\qquad
A \mapsto \|A\|_{F} = \left(\sum_{i=1}^p\sum_{j=1}^p A_{ij}^2\right)^{1/2}.
\end{align*}
[/definition]
For an estimator $\widehat\Sigma$ of $\Sigma$, the corresponding operator-norm and Frobenius-norm losses are $\|\widehat\Sigma-\Sigma\|_{\mathrm{op}}$ and $\|\widehat\Sigma-\Sigma\|_{F}$.
The operator norm measures the worst directional variance error, since $\|A\|_{\mathrm{op}}=\sup_{|v|=1}|v^\top Av|$ for symmetric $A$. The Frobenius norm aggregates errors over all entries or, equivalently, over all eigenvalue directions. Thus Frobenius risk often scales with $p^2/n$, while operator risk can scale with an intrinsic rank rather than with $p$ alone.
A distribution with many tiny eigenvalues behaves spectrally like a lower-dimensional distribution. Ambient dimension alone is therefore too crude for operator-norm covariance bounds: two covariance matrices may have the same size $p$ but very different amounts of variance spread across their spectrum.
The quantity needed for the next covariance bounds should count directions in proportion to their variance contribution, while remaining invariant under rescaling of the whole covariance matrix. The following definition captures this by comparing total variance with the largest directional variance, so that weak directions do not count the same way as dominant ones.
[definition: Effective Rank]
The effective rank is the map
\begin{align*}
r_{\mathrm{eff}}: \mathbb S_+^p\setminus\{0\} \to \mathbb R_{>0},
\qquad
\Sigma \mapsto r_{\mathrm{eff}}(\Sigma) = \frac{\operatorname{tr}(\Sigma)}{\|\Sigma\|_{\mathrm{op}}},
\end{align*}
where $\mathbb S_+^p$ denotes the cone of positive semidefinite matrices in $\mathbb R^{p\times p}$.
[/definition]
The effective rank always lies between $1$ and $\operatorname{rank}(\Sigma)$. It is stable under rescaling of $\Sigma$, and it is small when the spectrum is concentrated near its top eigenvalue. The following example calibrates the definition on the spiked models that later drive the PCA phase transition.
[example: Effective Rank of a Spiked Covariance]
Let $\Sigma=\lambda vv^\top+\sigma^2 I_p$, where $|v|=1$, $\lambda>0$, and $\sigma^2>0$. Since $v^\top v=|v|^2=1$, the action of $\Sigma$ on the spike direction is
\begin{align*}
\Sigma v=\lambda vv^\top v+\sigma^2 I_pv=\lambda v(v^\top v)+\sigma^2 v=(\lambda+\sigma^2)v.
\end{align*}
Thus $v$ is an eigenvector with eigenvalue $\lambda+\sigma^2$. If $w\perp v$, then $v^\top w=0$, so
\begin{align*}
\Sigma w=\lambda vv^\top w+\sigma^2 I_pw=\lambda v(v^\top w)+\sigma^2 w=\sigma^2 w.
\end{align*}
Hence the eigenvalues are $\lambda+\sigma^2$ once and $\sigma^2$ on the $(p-1)$-dimensional orthogonal complement of $v$.
Because $\lambda+\sigma^2>\sigma^2$, the operator norm is
\begin{align*}
\|\Sigma\|_{\mathrm{op}}=\lambda+\sigma^2.
\end{align*}
The trace is the sum of the eigenvalues, so
\begin{align*}
\operatorname{tr}(\Sigma)=(\lambda+\sigma^2)+(p-1)\sigma^2=\lambda+p\sigma^2.
\end{align*}
By the definition of effective rank,
\begin{align*}
r_{\mathrm{eff}}(\Sigma)=\frac{\operatorname{tr}(\Sigma)}{\|\Sigma\|_{\mathrm{op}}}=\frac{\lambda+p\sigma^2}{\lambda+\sigma^2}.
\end{align*}
If $\lambda\gg p\sigma^2$, then $p\sigma^2/\lambda$ and $\sigma^2/\lambda$ are both small, and
\begin{align*}
r_{\mathrm{eff}}(\Sigma)=\frac{1+p\sigma^2/\lambda}{1+\sigma^2/\lambda}
\end{align*}
is close to $1$. If $\lambda$ is comparable to $\sigma^2$, then $\lambda+p\sigma^2$ is of order $p\sigma^2$ while $\lambda+\sigma^2$ is of order $\sigma^2$, so the effective rank is of order $p$.
[/example]
This example shows why the same ambient dimension may lead to different spectral estimation behaviour. To turn that intuition into a probability bound, the course imposes a tail condition strong enough to control every one-dimensional projection uniformly.
For a real-valued random variable $Y$, the sub-Gaussian Orlicz norm used below is the functional
\begin{align*}
\|\cdot\|_{\psi_2}:L^0(\Omega,\mathcal F,\mathbb P;\mathbb R)\to[0,\infty],
\qquad
\|Y\|_{\psi_2}=\inf\left\{s>0:\mathbb E\left[\exp(Y^2/s^2)\right]\le 2\right\}.
\end{align*}
Finiteness of this norm is equivalent to Gaussian-type tail decay up to absolute constants.
[definition: Sub-Gaussian Random Vector with Covariance Proxy]
A mean-zero random vector $X \in \mathbb R^p$ with covariance matrix $\Sigma$ is $K$-sub-Gaussian with covariance proxy $\Sigma$ if for every $v \in \mathbb R^p$,
\begin{align*}
\|v^\top X\|_{\psi_2} \le K\,(v^\top \Sigma v)^{1/2}.
\end{align*}
[/definition]
This assumption says that every one-dimensional projection has Gaussian-type tails at the variance scale dictated by $\Sigma$. A fixed-direction tail bound does not yet control the covariance estimator, because the operator norm asks for the worst quadratic-form error over all unit vectors at once. The main difficulty is to make this uniform control depend on the intrinsic spectral size of $\Sigma$, rather than automatically paying the full ambient dimension.
[quotetheorem:5933]
[citeproof:5933]
The two terms in the bound correspond to the usual square-root sampling fluctuation and a higher-order term relevant when the effective rank is comparable to or larger than $n$. The independence assumption is what lets the empirical average concentrate at rate $n^{-1/2}$; if $X_1=\cdots=X_n=Z$ for a single mean-zero vector $Z$ with covariance $\Sigma$, then $\widehat\Sigma=ZZ^\top$ has the variability of one observation rather than $n$ observations. The sub-Gaussian hypothesis is also substantive: if $X=RZ$ with $Z\sim\mathcal N(0,I_p)$ and $R$ occasionally takes values of order $\sqrt n$ while preserving finite second moments, then one rare sample can dominate $\|\widehat\Sigma-\Sigma\|_{\mathrm{op}}$, and robust covariance estimators are then needed to recover comparable rates. Mean zero avoids mixing covariance estimation with mean estimation, while the factor $\|\Sigma\|_{\mathrm{op}}$ records the unavoidable scaling of the problem under $X_i\mapsto aX_i$. In the isotropic case $\Sigma=I_p$, the effective rank is $p$, so the next example records the benchmark high-dimensional scale against which later PCA failures are compared.
[example: Isotropic Gaussian Covariance Error]
Let $X_i\sim\mathcal N(0,I_p)$ be i.i.d. For $\Sigma=I_p$, all $p$ eigenvalues are equal to $1$, so
\begin{align*}
\operatorname{tr}(I_p)=\sum_{j=1}^p 1=p.
\end{align*}
Also $\|I_p\|_{\mathrm{op}}=1$, because $I_p v=v$ for every unit vector $v$. By the definition of effective rank,
\begin{align*}
r_{\mathrm{eff}}(I_p)=\frac{\operatorname{tr}(I_p)}{\|I_p\|_{\mathrm{op}}}=\frac{p}{1}=p.
\end{align*}
Applying *[Effective-Rank Sample Covariance Concentration Theorem](/theorems/5933)* with $\Sigma=I_p$ gives, for every $t\ge 1$, with probability at least $1-e^{-t}$,
\begin{align*}
\|\widehat\Sigma-I_p\|_{\mathrm{op}}\le C\|I_p\|_{\mathrm{op}}\left(\sqrt{\frac{r_{\mathrm{eff}}(I_p)+t}{n}}+\frac{r_{\mathrm{eff}}(I_p)+t}{n}\right).
\end{align*}
Substituting $\|I_p\|_{\mathrm{op}}=1$ and $r_{\mathrm{eff}}(I_p)=p$ gives
\begin{align*}
\|\widehat\Sigma-I_p\|_{\mathrm{op}}\le C\left(\sqrt{\frac{p+t}{n}}+\frac{p+t}{n}\right),
\end{align*}
where $C$ is the universal constant in the standard Gaussian case.
If $p/n\to\gamma\in(0,\infty)$ and $t$ is fixed, then
\begin{align*}
\frac{p+t}{n}=\frac{p}{n}+\frac{t}{n}\to\gamma+0=\gamma.
\end{align*}
Therefore the concentration scale satisfies
\begin{align*}
\sqrt{\frac{p+t}{n}}+\frac{p+t}{n}\to\sqrt{\gamma}+\gamma.
\end{align*}
This limiting scale is positive, so in proportional dimensions the sample covariance does not become accurate in operator norm.
[/example]
The Frobenius norm asks for a different comparison because it sums errors over many directions. Operator-norm concentration can be small even when the aggregate squared error over all covariance entries is large. To see the cost of estimating every pairwise covariance without hiding it behind worst-direction notation, one needs an exact Frobenius-risk calculation in a model where the fourth moments are explicit.
[quotetheorem:5934]
[citeproof:5934]
For $\Sigma=I_p$, the Frobenius risk is $(p^2+p)/n$, so Frobenius consistency in mean squared loss requires
\begin{align*}
\frac{p^2}{n}\to 0.
\end{align*}
The Gaussian assumption is used through the exact fourth-moment identity; for example, if the coordinates are independent Rademacher variables with covariance $I_p$, then $|X|^2=p$ deterministically and $\mathbb E[|X|^4]=p^2$, not $p^2+2p$, so the displayed Gaussian equality changes. Independence is again essential because cross terms in the squared Frobenius norm vanish only after separating different observations; in the extreme dependent model $X_1=\cdots=X_n=Z$, averaging does not reduce the Frobenius risk by a factor of $n$. This comparison is stronger than operator-norm consistency and reflects the cost of estimating all pairwise covariances, which is why later minimax bounds distinguish the two losses.
## Eigenspace Perturbation and Principal Subspaces
PCA does not require estimating the full covariance matrix. It requires estimating the eigenspaces associated with large eigenvalues, so the next question is how spectral error transfers into subspace error.
[definition: Spectral Projector and Principal Subspace]
Let $\Sigma \in \mathbb R^{p\times p}$ be symmetric with eigenvalues $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_p$. For $1 \le r < p$, let $V_r \subset \mathbb R^p$ be the span of the eigenvectors associated with $\lambda_1,\dots,\lambda_r$. The orthogonal projector onto $V_r$ is denoted $P_r$.
\begin{align*}
P_r:\mathbb R^p\to\mathbb R^p.
\end{align*}
The subspace $V_r$ is called the top-$r$ principal subspace of $\Sigma$.
[/definition]
The projector is the right object because a subspace has no preferred basis. If $U\in \mathbb R^{p\times r}$ has orthonormal columns spanning $V_r$, then $P_r=UU^\top$, and changing the basis inside $V_r$ leaves $P_r$ unchanged. To make the target statistically identifiable under perturbation, the next definition records the spectral separation around the chosen cluster.
[definition: Eigengap]
On the class of symmetric matrices $\Sigma\in\mathbb R^{p\times p}$ with ordered eigenvalues $\lambda_1(\Sigma)\ge\cdots\ge\lambda_p(\Sigma)$, the top-$r$ eigengap is the map
\begin{align*}
\Delta_r:\{\Sigma\in\mathbb R^{p\times p}:\Sigma=\Sigma^\top\}\to\mathbb R,
\qquad
\Sigma \mapsto \Delta_r(\Sigma) = \lambda_r(\Sigma)-\lambda_{r+1}(\Sigma),
\end{align*}
for $1\le r<p$.
[/definition]
A positive eigengap separates the target subspace from its orthogonal complement. Even a small matrix perturbation can rotate eigenvectors sharply when eigenvalues are tied or nearly tied, so spectral error by itself is not yet a PCA error bound.
The remaining deterministic problem is to convert a bound on the perturbation matrix into a bound on the angle between two invariant subspaces. The obstruction is that operator-norm error controls eigenvalues directly but controls eigenvectors only through the separating gap; without such a conversion, a covariance concentration estimate would say little about the principal components themselves.
[quotetheorem:5935]
[citeproof:5935]
This theorem turns deterministic matrix perturbation into a statistical tool once $E=\widehat\Sigma-\Sigma$. The eigengap condition is necessary: for instance, if $A=I_2$ and the target rank is declared to be $1$, every one-dimensional subspace is a top eigenspace. The perturbation $E_\varepsilon=\varepsilon\operatorname{diag}(1,0)$ selects $\operatorname{span}(e_1)$, while $F_\varepsilon=\varepsilon\operatorname{diag}(0,1)$ selects $\operatorname{span}(e_2)$, and both perturbations have operator norm $\varepsilon$ although the selected subspaces are orthogonal. The theorem is perturbative rather than probabilistic, so it does not explain how large $\|E\|_{\mathrm{op}}$ is or whether the perturbation event is likely. Its strength is modularity: any estimator or model that supplies an operator-norm covariance bound can be inserted into the same subspace perturbation inequality. Its limitation is also modularity: it treats the perturbation through a worst-direction norm, so it may ignore cancellations and variance structure specific to the target eigenspace. The next result supplies the statistical input from sample covariance concentration and states the resulting PCA rate directly in the notation of covariance estimation.
The preceding perturbation theorem becomes a PCA guarantee only after specifying the population and empirical projectors and checking that the perturbation is small enough for the empirical rank-$r$ cluster to match the population cluster. The next theorem packages those two inputs: sample covariance concentration supplies the perturbation size, while Davis-Kahan supplies the conversion from matrix error to subspace error.
[quotetheorem:5936]
[citeproof:5936]
The bound identifies two levers: sampling noise must be small, and the eigengap must be large enough to absorb that noise. The smallness condition is not cosmetic; without it, empirical eigenvalues can cross the population gap, and the phrase "sample top-$r$ subspace" may refer to a different spectral cluster. The result also inherits the crudeness of an operator-norm route: it is uniform and robust, but it may miss sharper subspace rates that exploit the particular eigenstructure or effective noise variance around the target eigenspace. The next example places these levers in the standard low-rank signal plus isotropic noise model used throughout PCA.
[example: Estimating a Rank r Principal Subspace]
Assume $\Sigma=\lambda UU^\top+\sigma^2 I_p$, where $U\in \mathbb R^{p\times r}$ has orthonormal columns, $\lambda>0$, and $\sigma^2>0$. Since $U^\top U=I_r$, every vector $x\in\operatorname{Range}(U)$ can be written as $x=Ua$, and then
\begin{align*}
\Sigma x=\lambda UU^\top Ua+\sigma^2 I_pUa=\lambda U(U^\top U)a+\sigma^2 Ua=\lambda Ua+\sigma^2 Ua=(\lambda+\sigma^2)x.
\end{align*}
If $y\perp\operatorname{Range}(U)$, then $U^\top y=0$, so
\begin{align*}
\Sigma y=\lambda UU^\top y+\sigma^2 I_py=\lambda U0+\sigma^2 y=\sigma^2 y.
\end{align*}
Thus the top $r$ eigenspace is $\operatorname{Range}(U)$, its projector is $UU^\top$, and the top-$r$ eigengap is
\begin{align*}
\Delta_r=(\lambda+\sigma^2)-\sigma^2=\lambda.
\end{align*}
Because $\lambda+\sigma^2>\sigma^2$, the operator norm is
\begin{align*}
\|\Sigma\|_{\mathrm{op}}=\lambda+\sigma^2.
\end{align*}
The trace is the sum of the eigenvalues, so
\begin{align*}
\operatorname{tr}(\Sigma)=r(\lambda+\sigma^2)+(p-r)\sigma^2=r\lambda+r\sigma^2+p\sigma^2-r\sigma^2=r\lambda+p\sigma^2.
\end{align*}
By the definition of effective rank,
\begin{align*}
r_{\mathrm{eff}}(\Sigma)=\frac{\operatorname{tr}(\Sigma)}{\|\Sigma\|_{\mathrm{op}}}=\frac{r\lambda+p\sigma^2}{\lambda+\sigma^2}.
\end{align*}
Applying *Principal Subspace Error Bound* with $\Delta_r=\lambda$, $\|\Sigma\|_{\mathrm{op}}=\lambda+\sigma^2$, and $P_r=UU^\top$ gives, with the stated high probability and under its perturbation smallness condition,
\begin{align*}
\|\widehat P_r-UU^\top\|_F\le \frac{C_K\sqrt r\,(\lambda+\sigma^2)}{\lambda}\left(\sqrt{\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}}+\frac{r_{\mathrm{eff}}(\Sigma)+t}{n}\right).
\end{align*}
Substituting the effective rank computed above gives
\begin{align*}
\|\widehat P_r-UU^\top\|_F\le \frac{C_K\sqrt r\,(\lambda+\sigma^2)}{\lambda}\left(\sqrt{\frac{(r\lambda+p\sigma^2)/(\lambda+\sigma^2)+t}{n}}+\frac{(r\lambda+p\sigma^2)/(\lambda+\sigma^2)+t}{n}\right).
\end{align*}
The factor $(\lambda+\sigma^2)/\lambda=1+\sigma^2/\lambda$ shows explicitly that increasing the spike-to-noise ratio $\lambda/\sigma^2$ reduces the eigengap penalty in the subspace error bound.
[/example]
This perturbation view is local: it assumes a population spectral gap and asks whether sampling noise crosses it. The next section explains why, in proportional dimensions, even a large sample can leave the sample eigenstructure systematically distorted.
## Classical PCA When Dimension Is Comparable to Sample Size
Classical PCA is often justified by laws of large numbers for fixed $p$ and $n\to\infty$. In high dimension the relevant asymptotic regime may instead satisfy
\begin{align*}
\frac{p}{n}\to \gamma>0,
\end{align*}
and the sample spectrum no longer concentrates around the population spectrum in the same way.
[definition: Proportional Asymptotic Regime]
A sequence of covariance estimation problems is in the proportional asymptotic regime if $p=p_n$ and
\begin{align*}
\frac{p_n}{n} \to \gamma \in (0,\infty)
\end{align*}
as $n\to\infty$.
[/definition]
In this regime the noise eigenvalues of $\widehat\Sigma$ spread over a non-degenerate interval. This creates a basic obstruction for PCA: even when the population covariance is exactly $I_p$, the largest empirical eigenvalue is pushed above the population variance level by high-dimensional noise. To decide whether an observed leading component is signal or fluctuation, the first benchmark is the almost-sure location of the white-noise spectral edge.
[quotetheorem:5937]
The result diagnoses PCA: when $\Sigma=I_p$, there is no preferred population direction, but the sample covariance still produces a largest empirical component with an inflated eigenvalue. Each hypothesis fixes part of the benchmark. Gaussianity gives the classical Wishart model; with heavy-tailed isotropic data, a rare observation with unusually large Euclidean norm can create a top eigenvalue above the Marchenko-Pastur edge, while some non-Gaussian light-tailed models require additional moment or universality assumptions before the same edge can be used. The white-noise covariance is also essential: if $\Sigma$ has even one spike, the top sample eigenvalue may separate from the noise bulk, and if $\Sigma$ has a non-flat spectrum, the limiting edge is governed by the deformed population spectrum rather than by $(1+\sqrt\gamma)^2$. The proportional regime is the source of the non-degenerate edge; when $p/n\to0$, the largest eigenvalue returns to $1$ under standard conditions, while if $p/n\to\infty$ the stated finite edge is no longer the right scale. The theorem controls the top eigenvalue, not the interpretability of the corresponding eigenvector, and this limitation is exactly what the next example illustrates.
[example: Spurious Principal Components in White Noise]
Let $X_i\sim \mathcal N(0,I_p)$ with $p/n\to\gamma>0$, and let $u$ be any unit vector in $\mathbb R^p$. The population variance in direction $u$ is
\begin{align*}
\operatorname{Var}(u^\top X_i)
= u^\top I_p u
= u^\top u
= |u|^2
=1.
\end{align*}
Thus every unit direction has the same population variance, so population PCA has no distinguished leading direction.
By *[Marchenko Pastur Edge for White Noise](/theorems/5937)*,
\begin{align*}
\lambda_1(\widehat\Sigma)\xrightarrow{a.s.}(1+\sqrt\gamma)^2.
\end{align*}
Since $\gamma>0$,
\begin{align*}
(1+\sqrt\gamma)^2
=1+2\sqrt\gamma+\gamma
>1.
\end{align*}
Therefore the largest sample eigenvalue eventually lies above the population variance level $1$, even though the population covariance is exactly $I_p$ and has no signal direction. The corresponding top empirical eigenvector is selected by random fluctuation, so retaining components only because their sample eigenvalues exceed $1$ creates false structure in proportional dimensions.
[/example]
The white-noise example shows that a large empirical eigenvalue is not by itself evidence of signal. The remaining question is how to add one genuine population direction without losing the clean noise benchmark. A rank-one spike is the minimal model that creates a single preferred direction and a single eigengap, so any failure of PCA can be attributed to the competition between that spike and the proportional-dimensional noise edge.
[definition: Rank One Spiked Covariance Model]
The rank one spiked covariance model is the Gaussian model
\begin{align*}
X_i \sim \mathcal N(0,\Sigma),
\qquad
\Sigma = (1+\theta)vv^\top + (I_p-vv^\top),
\end{align*}
where $|v|=1$ and $\theta>0$.
[/definition]
The population leading eigenvalue is $1+\theta$, with eigengap $\theta$. In fixed dimension any positive $\theta$ is eventually detectable, but proportional asymptotics impose a positive threshold because the noise edge can hide a weak spike. The central obstruction is eigenvector alignment: an empirical top eigenvalue may separate numerically while the corresponding direction still carries little information about $v$ below the BBP threshold.
[quotetheorem:5938]
For the statistical interpretation here, the theorem states that a population eigengap alone does not guarantee useful PCA when the gap is below the ambient noise scale. Each modelling condition is tied to that conclusion. The proportional-dimension assumption is essential to this failure: in fixed dimension, a positive spike is eventually separated by ordinary consistency of $\widehat\Sigma$. Gaussianity gives the exact BBP threshold in this form; for elliptical data, heavy-tailed coordinates, or dependent coordinates, the empirical eigenvector may have a different limiting alignment or may be governed by extreme observations rather than by the spike. Fixed spike strength matters as well: if $\theta=\theta_n$ grows with $n$, the spike can dominate the noise edge and recover consistency, while if $\theta_n\to0$ the model approaches white noise and no leading direction remains detectable. The rank-one structure isolates a single signal direction; with several spikes, repeated spike eigenvalues, or a structured sparse spike, the relevant object can be a multi-dimensional subspace or a sparsity-constrained direction, so the scalar threshold and the statement about one top eigenvector are no longer sufficient. The statement also does not give a finite-sample guarantee or a universal threshold for non-Gaussian noise; it identifies the asymptotic benchmark that motivates thresholding and shrinkage methods.
[example: Sample Eigenvector Inconsistency When the Spike Is Too Weak]
Let $p/n\to 1$ and consider
\begin{align*}
\Sigma=(1+\theta)vv^\top+(I_p-vv^\top),
\end{align*}
where $|v|=1$ and $0<\theta<1$. On the spike direction,
\begin{align*}
\Sigma v=(1+\theta)vv^\top v+(I_p-vv^\top)v.
\end{align*}
Since $v^\top v=|v|^2=1$,
\begin{align*}
\Sigma v=(1+\theta)v+v-v=(1+\theta)v.
\end{align*}
If $w\perp v$, then $v^\top w=0$, and therefore
\begin{align*}
\Sigma w=(1+\theta)vv^\top w+(I_p-vv^\top)w.
\end{align*}
Using $v^\top w=0$ gives
\begin{align*}
\Sigma w=0+w-0=w.
\end{align*}
Thus the population eigenvalue in direction $v$ is $1+\theta$, while every direction orthogonal to $v$ has eigenvalue $1$. Since $\theta>0$,
\begin{align*}
1+\theta>1,
\end{align*}
so the population covariance has a unique leading eigendirection, namely $\operatorname{span}(v)$.
Here the proportional limit is
\begin{align*}
\gamma=\lim_{n\to\infty}\frac{p}{n}=1,
\end{align*}
so the spiked-PCA threshold is
\begin{align*}
\sqrt{\gamma}=\sqrt{1}=1.
\end{align*}
Because $0<\theta<1$, we have $\theta<\sqrt{\gamma}$. By *[Eigenvector Inconsistency Below the Spiked PCA Threshold](/theorems/5938)*,
\begin{align*}
|\hat v^\top v|^2\xrightarrow{a.s.}0.
\end{align*}
Since $|\hat v^\top v|$ is the absolute cosine of the angle between $\hat v$ and $v$, the empirical first principal component becomes asymptotically orthogonal to the true signal direction even though the population covariance has a unique leading eigenvector.
[/example]
The inconsistency example explains the failure mode of classical PCA: the empirical eigenvector is not merely noisy around the truth; it may point in a direction carrying no asymptotic signal. This motivates either stronger structural assumptions, such as sparsity, or procedures that account for the random matrix noise edge.
## Minimax Limits for Covariance Estimation
Concentration inequalities give upper bounds for the sample covariance. To know whether these rates are statistical artifacts or fundamental limits, the course compares them with minimax lower bounds over covariance classes.
[definition: Bounded Spectrum Covariance Class]
For constants $0<m<M<\infty$, define
\begin{align*}
\mathcal C_p(m,M)=\{\Sigma\in\mathbb R^{p\times p}: \Sigma=\Sigma^\top,\; mI_p\preceq \Sigma\preceq MI_p\}.
\end{align*}
[/definition]
In the minimax results below, the observation model is $X_1,\dots,X_n\sim\mathcal N(0,\Sigma)$ i.i.d. with unknown $\Sigma\in\mathcal C_p(m,M)$.
This class rules out ill-conditioning, so lower bounds over it cannot be blamed on nearly singular covariances. The next theorem is needed to show that the spectral rate from sample covariance concentration is forced by information theory, not by the analysis of a particular estimator.
[quotetheorem:5939]
[citeproof:5939]
For $\Sigma=I_p$, the concentration upper bound gives operator error of order $\sqrt{p/n}+p/n$. When $p\lesssim n$, the lower bound matches the leading term $\sqrt{p/n}$, so the sample covariance is rate-optimal in operator norm over bounded-spectrum Gaussian classes in the moderate-dimensional range. The bounded-spectrum hypothesis matters because it prevents lower bounds from being driven by scale or ill-conditioning rather than by high-dimensional covariance structure. Outside the moderate-dimensional range the risk saturates at a constant scale, so consistency is no longer the right conclusion even though the minimax lower bound remains informative. The next theorem is needed because Frobenius loss asks a broader estimation question and has a larger unavoidable dimension factor.
[quotetheorem:5940]
[citeproof:5940]
This Frobenius lower bound relies on allowing many covariance entries to vary at once; if the covariance matrix were known to be diagonal, sparse, or otherwise structured, the $p^2/n$ scale would no longer be the correct benchmark. The bounded-spectrum condition again keeps the comparison focused on dimension rather than on degeneracy, and the cap at order $p$ records the finite diameter of the class under squared Frobenius loss. The theorem is a rate lower bound, not a sharp finite-sample constant calculation, and by itself it does not identify a particular estimator attaining the bound over every range of $(p,n,m,M)$. These lower bounds complete the comparison between norms. Operator loss is tied to spectral fluctuation and effective rank, while Frobenius loss is tied to aggregate entrywise uncertainty. PCA lies between them: it depends on operator perturbation, but its success also requires eigengaps and, in proportional dimensions, random matrix separation from the noise spectrum.
Covariance estimation and PCA showed how operator norm error, Frobenius error, and eigenspace recovery lead to different statistical thresholds. The next chapter focuses on spiked covariance models, where a low-rank signal must emerge from the random matrix bulk before it can be reliably detected.
# 9. Spiked Covariance Models and BBP Transitions
This chapter studies covariance matrices whose population spectrum is almost flat but contains a small number of preferred directions. Chapters 7 and 8 developed the global spectral laws for sample covariance matrices and their consequences for PCA, with the Marchenko-Pastur distribution describing the empirical eigenvalue bulk. Spiked covariance models ask what happens when a fixed-rank signal is added to that null model: when does an empirical eigenvalue leave the bulk, when does the corresponding eigenvector reveal the signal direction, and when is the spike statistically detectable at all?
The central phenomenon is a sharp transition. Below the BBP threshold, the sample spectrum behaves as if there were no spike at the edge scale relevant to PCA, and the leading eigenvector has asymptotically vanishing alignment with the signal. Above the threshold, an outlier eigenvalue appears at a deterministic location and the associated eigenvector has a nonzero limiting squared overlap with the population spike direction.
## Rank-One and Finite-Rank Spiked Covariance Models
The basic question is how to formalise a low-dimensional signal hidden inside high-dimensional isotropic noise. The model should be simple enough to admit exact asymptotic formulas, but rich enough to represent the covariance structure exploited by principal component analysis. This motivates starting with a single unknown direction and a single strength parameter.
[definition: Rank-One Spiked Covariance Model]
Let $p,n \in \mathbb N$, let $v \in \mathbb R^p$ satisfy $|v|=1$, and let $\theta \ge 0$. The rank-one spiked covariance model is the Gaussian observation model
\begin{align*}
X_1,\dots,X_n \overset{\mathrm{i.i.d.}}{\sim} \mathcal N(0,\Sigma).
\end{align*}
The covariance matrix is
\begin{align*}
\Sigma = I_p + \theta vv^\top .
\end{align*}
The sample covariance matrix is
\begin{align*}
\hat{\Sigma} = \frac{1}{n}\sum_{i=1}^n X_iX_i^\top .
\end{align*}
[/definition]
The parameter $v$ is the unknown signal direction, while $\theta$ is the spike strength. The population covariance has eigenvalue $1+\theta$ in direction $v$ and eigenvalue $1$ on $v^\perp$, so the model isolates the effect of one distinguished principal component. The following computation fixes the population geometry before sampling noise is introduced.
[example: Population Spectrum of a Rank-One Spike]
Let $\Sigma=I_p+\theta vv^\top$ with $|v|=1$. Every vector $x\in\mathbb R^p$ can be decomposed as $x=av+w$ with $w\perp v$, and then
\begin{align*}
\Sigma x=(I_p+\theta vv^\top)(av+w).
\end{align*}
Expanding the two terms gives
\begin{align*}
(I_p+\theta vv^\top)(av+w)=av+w+\theta vv^\top(av+w).
\end{align*}
Since $v^\top v=1$ and $v^\top w=0$, the scalar factor in the rank-one term is
\begin{align*}
v^\top(av+w)=a v^\top v+v^\top w=a.
\end{align*}
Therefore
\begin{align*}
\Sigma x=av+w+\theta av=(1+\theta)av+w.
\end{align*}
Taking $a=1$ and $w=0$ gives $\Sigma v=(1+\theta)v$, so $v$ is an eigenvector with eigenvalue $1+\theta$. Taking $a=0$ gives $\Sigma w=w$ for every nonzero $w\in v^\perp$, so every nonzero vector in $v^\perp$ is an eigenvector with eigenvalue $1$. Thus the population eigengap between the spike direction and the orthogonal directions is
\begin{align*}
(1+\theta)-1=\theta.
\end{align*}
This separates the population PCA problem, where any $\theta>0$ creates an eigengap, from the high-dimensional sample PCA problem, where sampling noise creates a competing spectral edge.
[/example]
The rank-one computation shows the clean signal-noise decomposition, but many data sets contain several principal directions. To ask which directions survive high-dimensional sampling noise, the next model keeps finitely many spikes while allowing $p$ and $n$ to diverge.
[definition: Finite-Rank Spiked Covariance Model]
Let $r\in \mathbb N$ be fixed, let $v_1,\dots,v_r\in \mathbb R^p$ be orthonormal, and let $\theta_1\ge \cdots \ge \theta_r>0$. The finite-rank spiked covariance model is the Gaussian observation model
\begin{align*}
X_1,\dots,X_n \overset{\mathrm{i.i.d.}}{\sim} \mathcal N(0,\Sigma).
\end{align*}
The covariance matrix is
\begin{align*}
\Sigma = I_p + \sum_{j=1}^r \theta_j v_jv_j^\top .
\end{align*}
[/definition]
Each nonzero spike changes only one population eigenvalue, so the perturbation has fixed rank even when $p$ is large. To compare the spike sizes with the random spectral bulk, we need the high-dimensional scaling in which the null spectrum has a non-degenerate Marchenko-Pastur limit.
[definition: Proportional Asymptotic Regime]
A sequence of spiked covariance models is in the proportional asymptotic regime with aspect ratio $\gamma\in(0,\infty)$ if
\begin{align*}
p=p_n, \qquad \frac{p_n}{n}\to \gamma,
\end{align*}
as $n\to\infty$, while the number of spikes $r$ and the spike strengths $\theta_1,\dots,\theta_r$ remain fixed.
[/definition]
In the null case $\Sigma=I_p$, the largest sample eigenvalue concentrates near $(1+\sqrt{\gamma})^2$ when the aspect ratio tends to $\gamma$. The rest of the chapter compares each spike to this edge and asks whether the population eigenvalue $1+\theta$ can create a sample eigenvalue outside the noise bulk.
## Eigenvalue Separation and the BBP Transition
The first problem is spectral: given a spike strength $\theta$, where does the largest eigenvalue of $\hat{\Sigma}$ go? Population PCA would suggest that stronger spikes always stand apart, but high-dimensional noise has a top edge that absorbs weak signals. We therefore begin by recording the null edge against which spike separation is measured.
[quotetheorem:5941]
[citeproof:5941]
This theorem sets the benchmark for spike separation: an empirical eigenvalue is informative for PCA only if it exits the bulk above $(1+\sqrt{\gamma})^2$ by an order-one amount. The proportional hypothesis is essential because the edge depends on the limiting aspect ratio; if $p$ is fixed while $n\to\infty$, the null sample covariance instead converges to $I_p$ and the top eigenvalue tends to $1$. The Gaussian assumption is mainly a clean route to the Bai-Yin input, and many non-Gaussian models have the same edge under moment assumptions, but this statement does not describe Tracy-Widom edge fluctuations or finite-sample error probabilities. With this limitation understood, the next calculation asks exactly when a rank-one population spike creates an order-one eigenvalue beyond the null edge.
The BBP transition theorem uses population-eigenvalue notation for the spike, while the surrounding statistical model writes the covariance as $\Sigma=I_p+\theta vv^\top$ and therefore uses $\theta$ for spike strength. Translating between the two, the theorem's spike eigenvalue is $1+\theta$, so its condition that the population eigenvalue exceed $1+\sqrt{\gamma}$ is exactly the spike-strength condition $\theta>\sqrt{\gamma}$. Above this threshold, the sample outlier location becomes
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)
\end{align*}
in the spike-strength notation used in this chapter. At and below the threshold, the leading sample eigenvalue remains at the upper Marchenko-Pastur edge $(1+\sqrt{\gamma})^2$.
In this spike-strength notation, the transition at $\theta=\sqrt{\gamma}$ is the Baik-Ben Arous-Peche, or BBP, transition. The fixed-spike assumption matters: the theorem compares a constant signal strength with a noise edge that remains order one in the proportional limit, while spikes drifting toward the critical value require a finer edge-scale analysis. Two contrasts show the role of the hypotheses. If $p$ is fixed and $n\to\infty$, then $\hat{\Sigma}\to\Sigma$ entrywise and the top sample eigenvalue tends to the population eigenvalue $1+\theta$, so there is no BBP threshold. If $\theta_n=\sqrt{\gamma}+a n^{-1/3}$, the outlier-edge separation is on the same scale as edge fluctuations, so the order-one limit in the theorem no longer decides finite-sample detection. The formula also shows the bias of sample eigenvalues: above the transition, the empirical outlier exceeds the population eigenvalue $1+\theta$ by a high-dimensional correction. Thus the result justifies a largest-eigenvalue test only for fixed supercritical alternatives separated from the threshold, and it motivates checking exactly what such a test can certify.
[example: PCA Detection Threshold for a Single Spike]
Take a sequence with $p/n\to\gamma$ and consider tests that use only the largest sample eigenvalue $\lambda_1(\hat{\Sigma})$. By *Marchenko-Pastur Edge for the Null Model*, under $H_0:\Sigma=I_p$,
\begin{align*}
\lambda_1(\hat{\Sigma}) \to (1+\sqrt{\gamma})^2
\end{align*}
in probability. Under a rank-one alternative with fixed spike strength $\theta>\sqrt{\gamma}$, the BBP transition formula above gives
\begin{align*}
\lambda_1(\hat{\Sigma}) \to (1+\theta)\left(1+\frac{\gamma}{\theta}\right)
\end{align*}
in probability.
The supercritical limit is strictly above the null edge because
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)=1+\theta+\frac{\gamma}{\theta}+\gamma.
\end{align*}
Also,
\begin{align*}
(1+\sqrt{\gamma})^2=1+2\sqrt{\gamma}+\gamma.
\end{align*}
Subtracting the null edge from the supercritical limit gives
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)-(1+\sqrt{\gamma})^2=\theta+\frac{\gamma}{\theta}-2\sqrt{\gamma}.
\end{align*}
Since $\theta>0$,
\begin{align*}
\theta+\frac{\gamma}{\theta}-2\sqrt{\gamma}=\frac{\theta^2-2\theta\sqrt{\gamma}+\gamma}{\theta}.
\end{align*}
Factoring the numerator gives
\begin{align*}
\frac{\theta^2-2\theta\sqrt{\gamma}+\gamma}{\theta}=\frac{(\theta-\sqrt{\gamma})^2}{\theta}.
\end{align*}
This is positive when $\theta>\sqrt{\gamma}$.
Choose a deterministic threshold $t$ with
\begin{align*}
(1+\sqrt{\gamma})^2<t<(1+\theta)\left(1+\frac{\gamma}{\theta}\right).
\end{align*}
Under $H_0$, convergence in probability to a limit below $t$ implies
\begin{align*}
\mathbb P_0\left(\lambda_1(\hat{\Sigma})>t\right)\to 0.
\end{align*}
Under the supercritical alternative, convergence in probability to a limit above $t$ implies
\begin{align*}
\mathbb P_\theta\left(\lambda_1(\hat{\Sigma})>t\right)\to 1.
\end{align*}
Thus the largest-eigenvalue test is asymptotically reliable for every fixed $\theta>\sqrt{\gamma}$. If $0<\theta\le\sqrt{\gamma}$, the same BBP transition instead gives
\begin{align*}
\lambda_1(\hat{\Sigma})\to (1+\sqrt{\gamma})^2
\end{align*}
under both the null model and the rank-one alternative, so no threshold separated by a fixed positive amount from the Marchenko-Pastur edge can detect a subcritical spike from $\lambda_1(\hat{\Sigma})$ alone.
[/example]
The rank-one example gives the testing intuition, but practical PCA often inspects several leading empirical eigenvalues. This motivates extending the outlier formula to finitely many spikes and asking whether the transition acts independently on each population component.
[quotetheorem:5942]
[citeproof:5942]
This result justifies the practical rule that the observed number of large eigenvalues estimates the number of supercritical components, not necessarily the full population rank. The fixed-rank assumption is part of the conclusion: the finite determinant equation remains low-dimensional, and the noise bulk is not itself reshaped by the spikes. A concrete failure mode is a covariance matrix whose first $\alpha p$ eigenvalues all equal $1+\theta$ for some fixed $\alpha\in(0,1)$. Even if each individual spike is modest, the empirical spectral distribution is then a deformed Marchenko-Pastur law determined by the whole population spectrum, not a standard bulk plus finitely many outliers. Repeated supercritical spike strengths also require interpreting the conclusion at the level of the associated outlier cluster or eigenspace rather than as a canonical labelling of individual eigenvectors. Weak population components can therefore exist while leaving no separated empirical eigenvalue.
## Eigenvector Alignment
Eigenvalue separation answers whether the spectrum contains a visible spike. The next question is whether the leading empirical eigenvector points in the true signal direction, since PCA is used primarily to estimate a subspace. We need a sign-invariant measure of directional recovery.
[definition: Squared Eigenvector Overlap]
In the rank-one spiked covariance model, let $\hat{v}_1$ be a unit eigenvector associated with $\lambda_1(\hat{\Sigma})$. The squared eigenvector overlap is
\begin{align*}
|\langle \hat{v}_1,v\rangle|^2.
\end{align*}
[/definition]
The absolute value removes the sign ambiguity of eigenvectors. A limit of zero means the empirical direction is asymptotically orthogonal to the signal, while a positive limit means PCA recovers nonzero directional information. This motivates a second BBP calculation: above the outlier threshold, how much of the empirical eigenvector is signal rather than noise?
[quotetheorem:5943]
[citeproof:5943]
The overlap formula contains two lessons. At the threshold the alignment turns on continuously from zero, and even far above threshold the overlap is less than one unless the spike strength diverges. The isolated-outlier hypothesis is essential: below the threshold the leading eigenvector is a noise-edge eigenvector, so a fixed signal direction has asymptotically zero overlap, and at repeated or clustered spikes individual empirical eigenvectors are not canonically matched to individual population vectors. The theorem gives a limit in probability under the stated Gaussian proportional model; it does not provide finite-sample confidence intervals, nor does it automatically cover every non-Gaussian distribution without the universality assumptions needed for eigenvector statistics. The following numerical case shows how eigenvalue visibility and eigenvector accuracy can be quite different.
[example: Quantifying PCA Loss Above the Threshold]
Let $\gamma=1$ and $\theta=2$. Since $\sqrt{\gamma}=\sqrt{1}=1$ and $2>1$, the spike is supercritical. By *[Finite-Rank BBP Outlier Location Theorem for Spiked Covariance Matrices](/theorems/5942)*, the limiting outlier location is
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)=(1+2)\left(1+\frac{1}{2}\right)=3\cdot\frac{3}{2}=\frac{9}{2}.
\end{align*}
The null upper edge is
\begin{align*}
(1+\sqrt{\gamma})^2=(1+\sqrt{1})^2=(1+1)^2=4.
\end{align*}
Thus the separated eigenvalue lies above the null edge by
\begin{align*}
\frac{9}{2}-4=\frac{9}{2}-\frac{8}{2}=\frac{1}{2}.
\end{align*}
By *Asymptotic Eigenvector Overlap*, the limiting squared overlap is
\begin{align*}
\frac{1-\gamma/\theta^2}{1+\gamma/\theta}=\frac{1-1/2^2}{1+1/2}=\frac{1-1/4}{3/2}=\frac{3/4}{3/2}=\frac{3}{4}\cdot\frac{2}{3}=\frac{1}{2}.
\end{align*}
For unit vectors, the squared sine loss is one minus the squared overlap, so its limit is
\begin{align*}
1-\frac{1}{2}=\frac{1}{2}.
\end{align*}
The empirical first principal component therefore contains nonzero signal, but it still loses half of the squared directional mass in the proportional limit; visible eigenvalue separation is not the same as near-perfect eigenvector estimation.
[/example]
For several spikes, eigenvectors estimate the supercritical signal subspace rather than all population directions. If spikes are repeated, the identifiable object is the corresponding subspace, since individual eigenvectors inside an equal-eigenvalue population subspace are not uniquely defined. This motivates measuring recovery by projections rather than by a chosen basis.
[remark: Subspace Rather Than Coordinate Recovery]
In a finite-rank model with a cluster of equal supercritical spike strengths, the empirical eigenspace associated with the corresponding outliers aligns with the population spike subspace. Individual empirical eigenvectors may rotate inside that subspace. The statistically meaningful target is then a projection matrix such as $VV^\top$, where the columns of $V$ form an [orthonormal basis](/page/Orthonormal%20Basis) for the signal subspace.
[/remark]
This distinction will reappear in minimax estimation of low-rank covariance matrices, where losses based on projection matrices are better behaved than losses depending on arbitrary choices of basis. It also marks a change in the chapter's viewpoint. So far the questions have been spectral and estimational: which eigenvalues separate, and how much signal is present in the associated eigenspaces? The next section asks a sharper testing question: even when PCA fails to recover a direction, could some other statistic still distinguish the spiked model from pure noise?
## Detection, Contiguity, and Likelihood Ratios
PCA gives a concrete test, but it does not settle the full statistical detection problem. The next problem is to compare spectral thresholds with information-theoretic thresholds derived from likelihood ratios and contiguity. We start by separating the null model from the composite alternative.
[definition: Spiked Covariance Detection Problem]
Fix $p,n\in\mathbb N$ and a spike strength $\theta>0$. The rank-one spiked covariance detection problem is the hypothesis test
\begin{align*}
H_0: X_1,\dots,X_n\overset{\mathrm{i.i.d.}}{\sim}\mathcal N(0,I_p).
\end{align*}
The alternative is
\begin{align*}
H_1: X_1,\dots,X_n\overset{\mathrm{i.i.d.}}{\sim}\mathcal N(0,I_p+\theta vv^\top),
\end{align*}
where the status of $v\in\mathbb R^p$, $|v|=1$, depends on the model class.
[/definition]
When $v$ is known, detection is a one-dimensional variance test and any fixed $\theta>0$ is detectable as $n\to\infty$. The BBP threshold concerns the more difficult composite problem in which the signal direction is unknown and the statistician must search over many possible directions. This motivates writing the optimal testing problem through a likelihood ratio.
[definition: Likelihood Ratio]
Let $P_0$ be the joint distribution of the data under $H_0$ and let $P_1$ be the joint distribution under a specified alternative distribution. If $P_1\ll P_0$, the likelihood ratio is
\begin{align*}
L:\Omega\to[0,\infty), \qquad L = \frac{dP_1}{dP_0},
\end{align*}
where $(\Omega,\mathcal F)$ is the underlying sample space.
[/definition]
Likelihood ratios connect testing to moment calculations. If $L$ remains bounded in $L^2(P_0)$ along an asymptotic sequence, the alternative cannot be separated from the null with vanishing total error by any test. To express this impossibility at the level of all tests, we use contiguity.
[definition: Contiguity]
Let $(P_n)$ and $(Q_n)$ be sequences of probability measures on measurable spaces $(\Omega_n,\mathcal F_n)$. The sequence $(Q_n)$ is contiguous with respect to $(P_n)$, written $Q_n\triangleleft P_n$, if for every sequence of events $A_n\in\mathcal F_n$,
\begin{align*}
P_n(A_n)\to 0 \quad \implies \quad Q_n(A_n)\to 0.
\end{align*}
The sequences are mutually contiguous if $Q_n\triangleleft P_n$ and $P_n\triangleleft Q_n$.
[/definition]
Contiguity is stronger than saying a particular statistic fails. It says that no event that is negligible under the null can become likely under the alternative, so asymptotic detection with vanishing total error is impossible. This motivates a usable sufficient condition based on the second moment of the likelihood ratio.
[quotetheorem:5944]
[citeproof:5944]
The second-moment criterion is only a sufficient condition for contiguity, not a necessary one. If the second moment diverges, the null and alternative may still be contiguous, and a sharper likelihood-ratio argument may be needed. A standard mechanism is lack of uniform integrability: in some critical planted models, rare samples with huge likelihood ratio make $\mathbb E_{P_n}[L_n^2]$ diverge, while the likelihood ratio itself still has a tight limiting distribution and the measures remain mutually contiguous by [Le Cam's first lemma](/theorems/5951).
The hypothesis $Q_n\ll P_n$ is part of the mechanism, not a technical decoration. It ensures that $L_n=dQ_n/dP_n$ exists and that every $Q_n$-probability can be rewritten as a $P_n$-expectation against $L_n$. Without absolute continuity, there may be an event with $P_n(A_n)=0$ but $Q_n(A_n)>0$; such an event is detected with no error and contiguity from $Q_n$ to $P_n$ fails before any second-moment calculation can begin. Thus the criterion applies only after the alternative has been expressed on the same null support, often by integrating a planted model over a prior.
Its strength is that it converts an all-tests impossibility statement into a concrete expectation under the null, so it is well matched to planted models where the alternative is averaged over a prior. For rotationally invariant priors on $v$, the likelihood ratio averages over all directions, and the key quantity becomes the random overlap between two independent spike directions. This motivates comparing the second-moment lower bound with the spectral BBP threshold in the spherical Gaussian model.
[quotetheorem:5945]
[citeproof:5945]
The theorem shows that for the spherical rank-one Gaussian model, PCA is not merely a convenient algorithm: its threshold matches the information-theoretic threshold. The spherical dense prior is doing real work because two independent spike directions have very small overlap, which is exactly what keeps the likelihood-ratio second moment controlled below the BBP threshold. The boundary case $\theta=\sqrt{\gamma}$ is not resolved by the displayed dichotomy; critical scaling requires edge fluctuation and refined likelihood-ratio analysis. The result also should not be read as a universal statement about all spike priors or all noise distributions: sparse spikes, structured directions, and non-Gaussian observations can have different information-theoretic and computational thresholds. The contrast motivates the following comparison, which previews computational-statistical gaps later in the course.
[example: Spectral and Information-Theoretic Thresholds]
In the spherical prior rank-one model with $p/n\to\gamma$, the spectral and information-theoretic thresholds coincide at $\theta=\sqrt{\gamma}$. First suppose $\theta>\sqrt{\gamma}$. Under the null model, *Marchenko-Pastur Edge for the Null Model* gives
\begin{align*}
\lambda_1(\hat{\Sigma})\to (1+\sqrt{\gamma})^2
\end{align*}
in probability, while under the planted spherical alternative, the BBP transition formula above gives
\begin{align*}
\lambda_1(\hat{\Sigma})\to (1+\theta)\left(1+\frac{\gamma}{\theta}\right)
\end{align*}
in probability. Expanding the supercritical limit gives
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)=1+\theta+\frac{\gamma}{\theta}+\gamma.
\end{align*}
The null edge expands as
\begin{align*}
(1+\sqrt{\gamma})^2=1+2\sqrt{\gamma}+\gamma.
\end{align*}
Therefore the gap between the planted limit and the null limit is
\begin{align*}
(1+\theta)\left(1+\frac{\gamma}{\theta}\right)-(1+\sqrt{\gamma})^2=\theta+\frac{\gamma}{\theta}-2\sqrt{\gamma}.
\end{align*}
Since $\theta>0$, this can be written over the common denominator $\theta$:
\begin{align*}
\theta+\frac{\gamma}{\theta}-2\sqrt{\gamma}=\frac{\theta^2+\gamma-2\theta\sqrt{\gamma}}{\theta}.
\end{align*}
Factoring the numerator gives
\begin{align*}
\frac{\theta^2+\gamma-2\theta\sqrt{\gamma}}{\theta}=\frac{(\theta-\sqrt{\gamma})^2}{\theta}.
\end{align*}
This quantity is positive when $\theta>\sqrt{\gamma}$. Hence one can choose a deterministic threshold $t$ satisfying
\begin{align*}
(1+\sqrt{\gamma})^2<t<(1+\theta)\left(1+\frac{\gamma}{\theta}\right).
\end{align*}
For the test that rejects when $\lambda_1(\hat{\Sigma})>t$, convergence in probability below $t$ under the null gives null rejection probability tending to $0$, and convergence in probability above $t$ under the planted alternative gives power tending to $1$.
Now suppose $0<\theta<\sqrt{\gamma}$. By the spherical-spike detection threshold theorem, the planted spherical distribution is contiguous with respect to the null distribution. If a sequence of tests has null rejection probability tending to $0$, and $A_n$ is its rejection event, then $P_0(A_n)\to 0$. Contiguity implies $P_1(A_n)\to 0$, so the same tests cannot have power tending to $1$. Thus below $\sqrt{\gamma}$ no asymptotically powerful test exists in the spherical dense model, while above $\sqrt{\gamma}$ the largest eigenvalue already succeeds. In sparse spiked covariance models the alternative is restricted to a smaller structured class of directions, so likelihood-ratio methods can sometimes detect below the spectral threshold; this is the basic mechanism behind later computational-statistical gaps.
[/example]
## Johnstone's Spiked Covariance Model and Statistical Interpretation
The last problem is to connect the asymptotic random-matrix statements to the statistical model introduced by Johnstone for high-dimensional PCA. This formulation emphasises principal component analysis as an estimator of covariance structure rather than only a spectral perturbation problem. We therefore reparametrise spike strengths in terms of the population eigenvalues themselves.
[definition: Johnstone Spiked Covariance Model]
Johnstone's spiked covariance model is the Gaussian model
\begin{align*}
X_1,\dots,X_n \overset{\mathrm{i.i.d.}}{\sim} \mathcal N(0,\Sigma_p),
\end{align*}
where $\Sigma_p$ has eigenvalues
\begin{align*}
\ell_1\ge \cdots \ge \ell_r>1, \qquad \ell_{r+1}=\cdots=\ell_p=1,
\end{align*}
with fixed $r$ as $p,n\to\infty$ and aspect ratio tending to $\gamma\in(0,\infty)$.
[/definition]
This is the same finite-rank covariance model under the reparametrisation $\ell_j=1+\theta_j$. Johnstone's viewpoint made the model central in statistics because it explains why classical PCA intuition fails when dimension and sample size grow together. In the notation of the BBP theorem quoted above, the separation condition becomes $\ell_j>1+\sqrt\gamma$: a population eigenvalue must exceed the noise level by an amount comparable to the square root aspect ratio before it is consistently visible as a separated empirical component. The fixed-rank and proportional-asymptotic hypotheses are still essential; the result describes a small number of principal components against a Marchenko-Pastur bulk, not arbitrary high-rank covariance estimation. It also guarantees separation of certain sample eigenvalues, not consistent recovery of the entire covariance matrix or accurate estimation of subcritical components. The next example translates the inequality into a sample-size requirement.
[example: Sample Size Needed for Eigenvalue Separation]
Suppose the population covariance has one spike eigenvalue $\ell=1.5$. In Johnstone's notation, the spike strength is
\begin{align*}
\theta=\ell-1=1.5-1=0.5.
\end{align*}
By the same BBP threshold, this population eigenvalue produces a separated sample eigenvalue exactly when
\begin{align*}
\ell>1+\sqrt{\gamma}.
\end{align*}
Substituting $\ell=1.5$ gives
\begin{align*}
1.5>1+\sqrt{\gamma}.
\end{align*}
Subtracting $1$ from both sides gives
\begin{align*}
0.5>\sqrt{\gamma}.
\end{align*}
Both sides are nonnegative, so squaring preserves the inequality:
\begin{align*}
(0.5)^2>\gamma.
\end{align*}
Since
\begin{align*}
(0.5)^2=0.25,
\end{align*}
the separation condition is
\begin{align*}
\gamma<0.25.
\end{align*}
Because $\gamma=p/n$ in the proportional regime, this means
\begin{align*}
\frac{p}{n}<\frac{1}{4}.
\end{align*}
Multiplying both sides by the positive number $4n$ gives
\begin{align*}
4p<n,
\end{align*}
so the sample size must satisfy $n>4p$ asymptotically for this spike to emerge as an outlier.
For comparison, if the sample size is only twice the dimension, then
\begin{align*}
n=2p, \qquad \gamma=\frac{p}{n}=\frac{p}{2p}=\frac{1}{2}.
\end{align*}
The threshold is then
\begin{align*}
\sqrt{\gamma}=\sqrt{\frac{1}{2}}.
\end{align*}
To compare $0.5$ with $\sqrt{1/2}$, square both nonnegative quantities:
\begin{align*}
(0.5)^2=\frac{1}{4}, \qquad \left(\sqrt{\frac{1}{2}}\right)^2=\frac{1}{2}.
\end{align*}
Since
\begin{align*}
\frac{1}{4}<\frac{1}{2},
\end{align*}
we have
\begin{align*}
0.5<\sqrt{\frac{1}{2}}.
\end{align*}
Thus the spike is below the BBP threshold when $n=2p$, even though the population covariance has eigengap
\begin{align*}
\ell-1=1.5-1=0.5.
\end{align*}
This shows that a genuine population principal component need not be visible as a separated empirical eigenvalue unless the sample size is large enough relative to the dimension.
[/example]
The chapter's conclusions can now be summarised in statistical language. Supercritical spikes are detectable by PCA and estimable with nonzero eigenvector overlap; subcritical dense spikes are hidden by the Marchenko-Pastur edge and are contiguous to the null under the spherical prior. This transition is one of the main examples in high-dimensional statistics where minimax lower bounds, likelihood-ratio calculations, and random matrix limits describe the same boundary. The same spike-versus-noise competition also appears in signal processing array models, factor analysis in econometrics, and community detection for random graphs, where low-rank structure must be separated from a high-dimensional random background.
Spiked covariance models make the transition from estimation to detection especially vivid, since the same threshold controls both estimation accuracy and statistical separability. We now treat detection itself as the primary object, using likelihood ratios and contiguity to understand when planted structure is invisible to any test.
# 10. Minimax Testing, Detection, and Contiguity
This chapter turns minimax lower bounds into testing problems in their own right. Chapters 1 through 3 used testing as a reduction tool for estimation, and Chapter 9 used likelihood ratios to compare spiked and null covariance models; here the testing problem is the object of study, and the central question is whether any procedure can reliably distinguish a null model from a structured alternative. The main techniques are total variation, chi-square divergence, second moments, and contiguity, all of which convert high-dimensional geometry into quantitative impossibility statements.
## Distinguishing Null and Alternative Models
A testing problem begins with two statistical models and asks whether the observed data contain enough information to separate them. In high dimension the alternative is often composite: the signal might be sparse, low-rank, or planted in an unknown direction, so the first task is to say what counts as a decision rule.
[definition: Test Between Two Models]
Let $(\mathcal X, \mathcal A)$ be a measurable space. A test between two families of probability measures $\mathcal P_0$ and $\mathcal P_1$ on $(\mathcal X, \mathcal A)$ is a measurable map $\phi: \mathcal X \to \{0,1\}$.
[/definition]
The convention is that $\phi=1$ rejects the null model. Once tests are formal objects, the next question is how to compare them, and for minimax theory the comparison must record both false rejection under the null and failure to reject under the alternative.
[definition: Testing Risk]
Let $\mathsf T(\mathcal X,\mathcal A)$ be the set of all measurable maps $\phi:\mathcal X\to\{0,1\}$. The testing-risk functional is the map
\begin{align*}
R(\cdot;\mathcal P_0,\mathcal P_1):\mathsf T(\mathcal X,\mathcal A)\to[0,2]
\end{align*}
defined by
\begin{align*}
R(\phi;\mathcal P_0,\mathcal P_1)
= \sup_{P\in \mathcal P_0} P(\phi=1)
+ \sup_{Q\in \mathcal P_1} Q(\phi=0).
\end{align*}
The minimax testing-risk functional is the map
\begin{align*}
R^*:\{(\mathcal P_0,\mathcal P_1):\mathcal P_0,\mathcal P_1\subseteq\mathcal P(\mathcal X,\mathcal A)\}\to[0,2]
\end{align*}
defined by
\begin{align*}
R^*(\mathcal P_0,\mathcal P_1)=\inf_\phi R(\phi;\mathcal P_0,\mathcal P_1),
\end{align*}
where the infimum is over all measurable tests.
[/definition]
Small minimax risk means reliable detection is possible; risk close to $1$ means that even the best test does no better than the baseline tradeoff between the two errors. In the simple null versus simple alternative case, this risk is controlled by the largest difference that the two distributions assign to a single event, so we need a metric built from events.
[definition: Total Variation Distance]
The total variation distance is the map
\begin{align*}
\operatorname{TV}:\mathcal P(\mathcal X,\mathcal A)\times\mathcal P(\mathcal X,\mathcal A)\to[0,1]
\end{align*}
defined by
\begin{align*}
\operatorname{TV}(P,Q)=\sup_{A\in\mathcal A}|P(A)-Q(A)|.
\end{align*}
[/definition]
Total variation measures the largest possible separation of probabilities assigned to the same event. In a binary testing problem, every deterministic test is the indicator of some rejection event, so the best possible test can only exploit the event where the two measures disagree most.
The basic obstruction is therefore geometric: if no measurable event separates $P$ from $Q$ by much, then every rejection rule must either reject often under the null or fail to reject often under the alternative. Simple versus simple testing is the setting where this obstruction can be expressed exactly in terms of the total variation distance.
[quotetheorem:5946]
[citeproof:5946]
The formula uses two probability measures on the same measurable space and tests that are measurable with respect to that shared $\sigma$-algebra. This common-space assumption is statistical, not just formal: a test designed for a Gaussian vector in $\mathbb R^d$ cannot be compared directly with a test designed for an unordered spectrum unless both models have first been pushed forward to the same observation space. At the extremes, if $P=Q$ then $\operatorname{TV}(P,Q)=0$ and the optimal risk is $1$, while if $P$ and $Q$ are mutually singular then $\operatorname{TV}(P,Q)=1$ and a measurable separating event gives zero risk. If the proposed rejection region is not measurable, the probabilities in the risk are undefined; for instance a nonmeasurable subset $A$ of $[0,1]$ cannot be used as $\mathbb{1}_A$ under [Lebesgue measure](/page/Lebesgue%20Measure). The theorem also does not solve composite testing, because replacing a class $\{Q_\theta\}$ by one favourable member can miss the hard average behaviour of the class. This is why the next step is not to select one signal, but to average many signals into a distribution whose likelihood ratio reflects the geometry of the whole class.
The formula gives an exact lower bound once the alternative is a single distribution. Composite high-dimensional alternatives have many possible signals, so the next step is to average them into one distribution in a way that preserves any lower bound for uniform testing.
[definition: Mixture Alternative]
Let $(\Theta,\mathcal T)$ be a measurable parameter space, let $\pi$ be a probability measure on $(\Theta,\mathcal T)$, and let $\theta\mapsto Q_\theta$ be a probability kernel from $(\Theta,\mathcal T)$ to $(\mathcal X,\mathcal A)$. The mixture alternative is the probability measure $Q_\pi$ on $(\mathcal X,\mathcal A)$ defined by
\begin{align*}
Q_\pi(A)=\int_\Theta Q_\theta(A)\,d\pi(\theta),\qquad A\in\mathcal A.
\end{align*}
[/definition]
The mixture construction is where high-dimensional combinatorics enters: a large alternative class may be hard to test because the average signal washes out under the null. This motivates the following theorem, which justifies replacing a composite alternative by a prior-averaged alternative when proving lower bounds.
[quotetheorem:5947]
[citeproof:5947]
The prior in this theorem is not a Bayesian modelling assumption; it is an adversarial device for proving a frequentist lower bound. The hypothesis that $Q_\pi$ is a genuine mixture matters: without a measurable kernel $\theta\mapsto Q_\theta$, the averaged quantities $Q_\pi(A)$ need not define a probability measure, so the total-variation comparison has no target distribution. The theorem also does not say that every prior gives a sharp lower bound; for instance, putting all prior mass on one very strong sparse signal may produce a mixture far from the null even when most alternatives near the boundary are hard. In applications the prior must therefore be chosen to match the hard part of the alternative, such as uniformly random supports in sparse testing or uniformly random directions in rotationally invariant matrix models.
This reduction says that a hard average alternative is enough to prove a hard composite problem. A first concrete instance is sparse Gaussian detection, where averaging over unknown supports produces a distribution whose signal location is hidden.
[example: Sparse Gaussian Mixture Null Comparison]
Let $P=\mathcal N(0,I_d)$. For each support $S\subseteq\{1,\dots,d\}$ with $|S|=s$, write $Q_S=\mathcal N(a u_S,I_d)$, where $(u_S)_i=\mathbb 1_{\{i\in S\}}$. The density ratio of $Q_S$ with respect to $P$ is obtained by expanding the two Gaussian exponents:
\begin{align*}
L_S(x)=\frac{dQ_S}{dP}(x)=\exp\left(-\frac12\|x-a u_S\|_2^2+\frac12\|x\|_2^2\right).
\end{align*}
Since $\|x-a u_S\|_2^2=\|x\|_2^2-2a\sum_{i\in S}x_i+a^2\|u_S\|_2^2$ and $\|u_S\|_2^2=s$, this becomes
\begin{align*}
L_S(x)=\exp\left(a\sum_{i\in S}x_i-\frac{s a^2}{2}\right).
\end{align*}
If $\pi$ is uniform over the $\binom{d}{s}$ supports of size $s$, then the mixture likelihood ratio is therefore
\begin{align*}
L_\pi(x)=\frac{dQ_\pi}{dP}(x)=\binom{d}{s}^{-1}\sum_{|S|=s}\exp\left(a\sum_{i\in S}x_i-\frac{s a^2}{2}\right).
\end{align*}
By *Mixture Lower Bound*, every test for the full sparse alternative has risk at least $1-\operatorname{TV}(P,Q_\pi)$. To control this total variation through the second moment, let $S$ and $S'$ be independent uniform supports of size $s$. Then
\begin{align*}
\mathbb E_P[L_\pi(X)^2]=\mathbb E_{S,S'}\mathbb E_P[L_S(X)L_{S'}(X)].
\end{align*}
For fixed $S,S'$, multiplying the two likelihood ratios gives
\begin{align*}
L_S(X)L_{S'}(X)=\exp\left(a\sum_{i\in S}X_i+a\sum_{i\in S'}X_i-sa^2\right).
\end{align*}
Put $k=|S\cap S'|$. The coefficient of $X_i$ in the exponent is $2a$ for $i\in S\cap S'$, is $a$ for $i\in S\triangle S'$, and is $0$ otherwise. Since $|S\triangle S'|=2s-2k$, the standard Gaussian moment formula $\mathbb E\exp(tZ)=\exp(t^2/2)$ for $Z\sim\mathcal N(0,1)$ gives
\begin{align*}
\mathbb E_P[L_S(X)L_{S'}(X)]=\exp\left(\frac12\left(4a^2k+a^2(2s-2k)\right)-sa^2\right).
\end{align*}
The exponent simplifies as $\frac12(4a^2k+2a^2s-2a^2k)-sa^2=a^2k$, so
\begin{align*}
\mathbb E_P[L_S(X)L_{S'}(X)]=\exp\left(a^2|S\cap S'|\right).
\end{align*}
Thus
\begin{align*}
\chi^2(Q_\pi\|P)=\mathbb E_P[L_\pi(X)^2]-1=\mathbb E_{S,S'}\exp\left(a^2|S\cap S'|\right)-1.
\end{align*}
By *[Total Variation Bound by Chi-Square Divergence](/theorems/5948)*, if this overlap moment is close to $1$, then $\operatorname{TV}(P,Q_\pi)$ is small. The sparse testing lower bound is therefore reduced to understanding the random overlap of two independently planted supports.
[/example]
## Chi-Square Bounds and the Second Moment Method
Total variation is conceptually exact, but it can be difficult to compute for high-dimensional mixtures. The chi-square divergence gives a tractable upper bound on total variation and turns the testing problem into a second moment under the null.
[definition: Chi-Square Divergence]
Let
\begin{align*}
\mathcal D_{\chi^2}(\mathcal X,\mathcal A)=\{(Q,P)\in\mathcal P(\mathcal X,\mathcal A)^2:Q\ll P\}.
\end{align*}
The chi-square divergence is the map
\begin{align*}
\chi^2:\mathcal D_{\chi^2}(\mathcal X,\mathcal A)\to[0,\infty]
\end{align*}
defined by
\begin{align*}
\chi^2(Q\|P)=\int \left(\frac{dQ}{dP}-1\right)^2\,dP.
\end{align*}
[/definition]
The likelihood ratio $L=dQ/dP$ has $\mathbb E_P[L]=1$, so $\chi^2(Q\|P)=\mathbb E_P[L^2]-1$. In lower-bound arguments, total variation is the direct testing quantity, but it is often hard to compute for high-dimensional mixtures.
The useful substitute is a second-moment obstruction: if the likelihood ratio has little $L^2(P)$ fluctuation, then it cannot concentrate enough mass on a small null-probability event to create a powerful test. This gives a computable route from likelihood-ratio calculations to testing lower bounds, provided the second moment can be compared back to total variation.
[quotetheorem:5948]
[citeproof:5948]
The absolute-continuity hypothesis is needed because the likelihood ratio $dQ/dP$ is the object being squared. If $Q$ has a singular component, for example $P=\mathcal N(0,1)$ and $Q=\delta_0$ on $\mathbb R$, then $Q\not\ll P$ and the displayed chi-square integral is not a finite null second moment. The inequality is one-sided: a large chi-square divergence does not by itself prove that testing is possible, since a sequence of likelihood ratios can be enormous on events of vanishing $P$-probability and inflate $\mathbb E_P[L^2]$ while contributing little to typical samples. Thus small chi-square is a sufficient route to lower bounds, but large chi-square must be supplemented by an explicit test or by a sharper analysis such as truncation.
The theorem is most useful when $Q$ is a mixture. In that case the likelihood ratio is itself an average of likelihood ratios, so its second moment can be rewritten as an overlap calculation between two independent planted signals.
[quotetheorem:5949]
[citeproof:5949]
The measurability assumptions are not cosmetic. Without a probability kernel, the formula $\int Q_\theta(A)\,d\pi(\theta)$ may fail to define a measure. A concrete pathology is obtained by taking $\Theta=[0,1]$ with Lebesgue measure, choosing a nonmeasurable set $B\subset[0,1]$, and setting $Q_\theta=\delta_0$ for $\theta\in B$ and $Q_\theta=\delta_1$ for $\theta\notin B$ on the observation space $\{0,1\}$. For the event $A=\{0\}$, the map $\theta\mapsto Q_\theta(A)=\mathbb{1}_B(\theta)$ is not measurable, so $Q_\pi(A)$ is not defined by a [Lebesgue integral](/page/Lebesgue%20Integral). Without a jointly measurable likelihood-ratio version, the product $L_\theta L_{\theta'}$ is not a well-defined random variable on the product space. The identity itself does not prove impossibility unless the resulting second moment is close enough to $1$; it only converts the statistical question into an overlap expectation. The next criterion records the extra asymptotic condition that turns this identity into a testing lower bound.
The identity reduces many impossibility proofs to bounding a single second moment. If this second moment is asymptotically equal to its null value, the mixture cannot be separated from the null in total variation, which gives the following criterion.
[quotetheorem:5950]
[citeproof:5950]
The absolute-continuity hypothesis is part of the criterion, not a technical afterthought, because $L_n=dQ_n/dP_n$ must exist as a null random variable. If $Q_n$ assigns positive mass to an event $A_n$ with $P_n(A_n)=0$, then the test $\phi_n=\mathbb{1}_{A_n}$ has zero type I error and positive power on that singular component, a behaviour the null second moment cannot encode. The criterion also requires convergence of the second moment to $1$, not merely boundedness. If $\mathbb E_{P_n}[L_n^2]$ stays bounded away from both $1$ and infinity, the argument gives tight likelihood ratios and often contiguity, but it does not force total variation to vanish. This distinction matters in sharp random matrix problems, where below-threshold alternatives can be contiguous to the null even though the likelihood ratio has a non-degenerate limiting distribution.
This criterion explains why random overlap distributions appear throughout high-dimensional testing. In low-rank matrix models, two independent planted directions have a small inner product, and the second moment calculation asks whether that small overlap is still amplified by the signal strength.
[example: Rank-One Gaussian Spike Below Spectral Scale]
Let $P_d$ be the null law of the GOE matrix $W$, with $W_{ij}\sim\mathcal N(0,1/d)$ for $i<j$ and $W_{ii}\sim\mathcal N(0,2/d)$. For a fixed unit vector $v\in\mathbb R^d$, write $M_v=\lambda vv^\top$, so the spiked law is the law of $W+M_v$. The Gaussian shift formula applied to the independent coordinates $\{Y_{ij}:i\le j\}$ gives
\begin{align*}
L_v(Y)=\prod_{i<j}\exp\left(d(M_v)_{ij}Y_{ij}-\frac d2(M_v)_{ij}^2\right)\prod_i\exp\left(\frac d2(M_v)_{ii}Y_{ii}-\frac d4(M_v)_{ii}^2\right).
\end{align*}
Since $\operatorname{tr}(M_vY)=\sum_i(M_v)_{ii}Y_{ii}+2\sum_{i<j}(M_v)_{ij}Y_{ij}$ and $\operatorname{tr}(M_v^2)=\sum_i(M_v)_{ii}^2+2\sum_{i<j}(M_v)_{ij}^2$, this product is
\begin{align*}
L_v(Y)=\exp\left(\frac d2\operatorname{tr}(M_vY)-\frac d4\operatorname{tr}(M_v^2)\right).
\end{align*}
Now $\operatorname{tr}(M_vY)=\lambda\operatorname{tr}(vv^\top Y)=\lambda v^\top Yv$, while
\begin{align*}
\operatorname{tr}(M_v^2)=\lambda^2\operatorname{tr}(vv^\top vv^\top)=\lambda^2\operatorname{tr}(v(v^\top v)v^\top)=\lambda^2\operatorname{tr}(vv^\top)=\lambda^2.
\end{align*}
Thus
\begin{align*}
L_v(Y)=\exp\left(\frac{d\lambda}{2}v^\top Yv-\frac{d\lambda^2}{4}\right).
\end{align*}
Let $\pi$ be the uniform distribution on the unit sphere, and let $v,v'$ be independent draws from $\pi$. By *[Mixture Second Moment Identity](/theorems/5949)*,
\begin{align*}
\mathbb E_{P_d}[L_\pi(Y)^2]=\mathbb E_{v,v'}\mathbb E_{P_d}[L_v(Y)L_{v'}(Y)].
\end{align*}
For fixed $v,v'$, put $A=M_v+M_{v'}$. The same independent-coordinate Gaussian moment formula gives
\begin{align*}
\mathbb E_{P_d}\exp\left(\frac d2\operatorname{tr}(AY)\right)=\exp\left(\frac d4\operatorname{tr}(A^2)\right).
\end{align*}
Therefore
\begin{align*}
\mathbb E_{P_d}[L_v(Y)L_{v'}(Y)]=\exp\left(\frac d4\operatorname{tr}\left((M_v+M_{v'})^2\right)-\frac d4\operatorname{tr}(M_v^2)-\frac d4\operatorname{tr}(M_{v'}^2)\right).
\end{align*}
The trace term expands as
\begin{align*}
\operatorname{tr}\left((M_v+M_{v'})^2\right)=\operatorname{tr}(M_v^2)+\operatorname{tr}(M_{v'}^2)+2\operatorname{tr}(M_vM_{v'}).
\end{align*}
Also,
\begin{align*}
\operatorname{tr}(M_vM_{v'})=\lambda^2\operatorname{tr}(vv^\top v'v'^\top)=\lambda^2\operatorname{tr}(v(v^\top v')v'^\top)=\lambda^2(v^\top v')^2.
\end{align*}
Since $\operatorname{tr}(M_v^2)=\operatorname{tr}(M_{v'}^2)=\lambda^2$, substitution gives
\begin{align*}
\mathbb E_{P_d}[L_v(Y)L_{v'}(Y)]=\exp\left(\frac d4(2\lambda^2+2\lambda^2\langle v,v'\rangle^2)-\frac{d\lambda^2}{2}\right).
\end{align*}
The constant terms cancel, leaving
\begin{align*}
\mathbb E_{P_d}[L_v(Y)L_{v'}(Y)]=\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right).
\end{align*}
Hence
\begin{align*}
\mathbb E_{P_d}[L_\pi(Y)^2]=\mathbb E_{v,v'}\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right).
\end{align*}
By rotational invariance, $\langle v,v'\rangle^2$ has the $\operatorname{Beta}(1/2,(d-1)/2)$ density
\begin{align*}
f_d(t)=\frac{\Gamma(d/2)}{\Gamma(1/2)\Gamma((d-1)/2)}t^{-1/2}(1-t)^{(d-3)/2},\qquad 0<t<1.
\end{align*}
Thus
\begin{align*}
\mathbb E_{v,v'}\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right)=\frac{\Gamma(d/2)}{\Gamma(1/2)\Gamma((d-1)/2)}\int_0^1 e^{d\lambda^2 t/2}t^{-1/2}(1-t)^{(d-3)/2}\,dt.
\end{align*}
For $0<t<1$, the inequality $\log(1-t)\le -t$ implies $(1-t)^{(d-3)/2}\le e^{-(d-3)t/2}$. Therefore
\begin{align*}
e^{d\lambda^2 t/2}(1-t)^{(d-3)/2}\le \exp\left(-\frac{d(1-\lambda^2)-3}{2}t\right).
\end{align*}
If $\lambda<1$, then $d(1-\lambda^2)-3>0$ for all large $d$. Using the gamma-ratio asymptotic $\Gamma(d/2)/\Gamma((d-1)/2)\le C\sqrt d$ for large $d$, we get
\begin{align*}
\mathbb E_{v,v'}\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right)\le C\sqrt d\int_0^\infty t^{-1/2}\exp\left(-\frac{d(1-\lambda^2)-3}{2}t\right)\,dt.
\end{align*}
The gamma integral identity $\int_0^\infty t^{-1/2}e^{-\alpha t}\,dt=\Gamma(1/2)\alpha^{-1/2}$ for $\alpha>0$ gives
\begin{align*}
\mathbb E_{v,v'}\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right)\le C\sqrt d\,\Gamma(1/2)\left(\frac{d(1-\lambda^2)-3}{2}\right)^{-1/2}.
\end{align*}
For fixed $\lambda<1$, the final expression is bounded by a constant $C_\lambda<\infty$ independent of $d$. Thus the mixture likelihood ratios have bounded second moments below the spectral scale. This gives a contiguity-style boundedness statement, not total variation convergence to zero, because the second moment is controlled but not shown to converge to $1$; it matches the regime where the top eigenvalue does not separate from the null noise edge.
[/example]
## Contiguity and Le Cam Lemmas
Second moments often prove a stronger qualitative statement than a single lower bound: every event that is rare under the null is also rare under the alternative. This is the language of contiguity, and it is useful when studying sharp thresholds and random matrix limits.
[definition: Contiguity]
Let $(P_n)$ and $(Q_n)$ be sequences of probability measures on measurable spaces $(\mathcal X_n,\mathcal A_n)$. The sequence $(Q_n)$ is contiguous with respect to $(P_n)$, written $Q_n\triangleleft P_n$, if for every sequence of events $A_n\in\mathcal A_n$,
\begin{align*}
P_n(A_n)\to0 \implies Q_n(A_n)\to0.
\end{align*}
The sequences are mutually contiguous if $Q_n\triangleleft P_n$ and $P_n\triangleleft Q_n$.
[/definition]
Contiguity rules out consistent tests of $P_n$ against $Q_n$, because a rejection region with vanishing type I error must also have vanishing power under $Q_n$. The difficulty is that absolute continuity at each fixed $n$ does not prevent alternative mass from escaping into events whose null probability tends to zero.
To verify contiguity from likelihood ratios, one needs a condition that rules out this asymptotic loss of mass. The following criterion expresses exactly that requirement as uniform integrability under the null.
[quotetheorem:5951]
[citeproof:5951]
Tightness alone is insufficient because it ignores how much $Q_n$-mass is carried by the likelihood-ratio tail. For example, if $L_n=n$ on an event of $P_n$-probability $1/n$ and $0$ elsewhere, then $(L_n)$ is tight under $P_n$, but $Q_n$ concentrates on an event whose $P_n$-probability tends to $0$. The condition $Q_n\ll P_n$ is also necessary for this likelihood-ratio formulation; if $Q_n$ has singular mass, rare null events can have positive alternative probability regardless of any density on the absolutely continuous part. Uniform integrability is exactly the condition excluding this escape of weighted mass, and the next lemma uses the same no-loss-of-mass idea to transfer limiting distributions.
The first lemma tells us when rare null events remain rare, but it does not describe the limiting law of a statistic under the alternative. This motivates the following theorem, which shows that the alternative limit is obtained by tilting the null limit by the limiting likelihood ratio.
[quotetheorem:5952]
[citeproof:5952]
The condition $\mathbb E[L]=1$ is essential. If the limiting likelihood ratio loses mass, then the limit under $P_n$ has not captured all of the $Q_n$-probability, and the tilted expression is only a subprobability measure; for example, the likelihood ratio $L_n=n$ on an event of $P_n$-probability $1/n$ and $0$ elsewhere converges to $0$ in distribution but has expectation $1$ for every $n$. The common state space $S$ and Borel measurability of $T_n$ are also needed, since weak convergence and continuity sets are defined for laws on a specified measurable [topological space](/page/Topological%20Space). The lemma does not say that all statistics have the same limit under $P_n$ and $Q_n$; it says that their alternative limits are obtained by likelihood-ratio tilting. This is the mechanism used next in random matrix and sparse models, where contiguity can coexist with shifted limiting statistics.
The two lemmas give complementary uses of likelihood ratios: uniform integrability proves indistinguishability of rare events, while joint convergence describes how statistics are shifted under a contiguous alternative. This distinction is important in random matrix models, where global spectral behaviour may remain null-like while refined statistics change.
[remark: Le Cam Lemmas in Detection Problems]
Le Cam's first lemma is used to prove that no asymptotically powerful test exists, while the second lemma describes how statistics change under alternatives that remain contiguous to the null. In spiked random matrix models, this often means that below threshold the empirical spectral distribution has the same limit under both models, but refined linear spectral statistics may acquire a shifted Gaussian limit.
[/remark]
## Sparse Normal Means and the Ingster Boundary
The sparse normal means model is the canonical example where minimax testing has a sharp high-dimensional phase transition. The question is how large the nonzero coordinates must be before a sparse signal can be detected in Gaussian noise.
[definition: Sparse Normal Means Detection Problem]
For each dimension $d$, observe $X=(X_1,\dots,X_d)\in\mathbb R^d$. Under the null,
\begin{align*}
H_0: X\sim\mathcal N(0,I_d).
\end{align*}
Under the alternative,
\begin{align*}
H_1: X\sim\mathcal N(\mu,I_d),\qquad |\operatorname{supp}(\mu)|=s_d,
\end{align*}
with nonzero coordinates constrained by a prescribed amplitude condition.
[/definition]
The detection problem has two competing sources of difficulty: the coordinates are noisy, and the support is unknown. Ingster's boundary identifies the signal amplitude at which these two effects balance in the sparse regime $s_d=d^{1-\beta}$.
[quotetheorem:5953]
[citeproof:5953]
The theorem separates dense-energy detection from sparse-extreme detection. The sparsity assumption $\beta\in(1/2,1)$ is essential for this two-branch boundary: in denser regimes quadratic energy tests have a different scaling, while for nearly finite support the maximum-coordinate analysis dominates almost entirely. The Gaussian noise and fixed amplitude calibration $a_d=\sqrt{2r\log d}$ are also part of the statement; changing the tail behaviour or allowing heterogeneous amplitudes changes the exponent calculation. Below the boundary, the planted sparse distribution is contiguous to the null after the right prior averaging; above it, a test based on threshold exceedances accumulates enough evidence. The phase diagram suggested by the two formulae is therefore not just decorative: it records which testing mechanism controls each part of the sparse regime.
[example: Dense Versus Sparse Detection Heuristics]
For the positive sparse normal means alternative, write $s_d=d^{1-\beta}$ and $a_d=\sqrt{2r\log d}$. Consider the quadratic energy statistic
\begin{align*}
T_d=\sum_{i=1}^d (X_i^2-1).
\end{align*}
Under the null, $X_i\sim\mathcal N(0,1)$ independently, so $\mathbb E_0[X_i^2-1]=0$ and
\begin{align*}
\operatorname{Var}_0(X_i^2-1)=\mathbb E_0[X_i^4]-2\mathbb E_0[X_i^2]+1=3-2+1=2.
\end{align*}
Independence gives $\mathbb E_0[T_d]=0$ and
\begin{align*}
\operatorname{Var}_0(T_d)=\sum_{i=1}^d \operatorname{Var}_0(X_i^2-1)=2d,
\end{align*}
so the null fluctuation scale is $\sqrt{2d}$.
Under an alternative with support $S$ and $\mu_i=a_d\mathbb 1_{\{i\in S\}}$, write $X_i=\mu_i+Z_i$ with independent $Z_i\sim\mathcal N(0,1)$. Then
\begin{align*}
\mathbb E_\mu[X_i^2-1]=\mathbb E[(\mu_i+Z_i)^2-1]=\mu_i^2+2\mu_i\mathbb E[Z_i]+\mathbb E[Z_i^2]-1=\mu_i^2.
\end{align*}
Summing over coordinates gives
\begin{align*}
\mathbb E_\mu[T_d]=\sum_{i=1}^d\mu_i^2=s_d a_d^2=d^{1-\beta}\cdot 2r\log d.
\end{align*}
Thus the energy statistic is naturally powerful when its alternative mean shift is large compared with its null standard deviation, that is when
\begin{align*}
d^{1-\beta}\cdot 2r\log d\gg \sqrt{2d}.
\end{align*}
This condition is most favorable near $\beta=1/2$, because then the number $s_d=d^{1-\beta}$ of planted coordinates is still large enough for the individual squared signals to accumulate.
When $\beta$ is close to $1$, the support size $s_d=d^{1-\beta}$ is much smaller, so tail counts are better aligned with the signal. For a threshold $t_d=\sqrt{2q\log d}$, the null expected number of positive exceedances is
\begin{align*}
d\,\mathbb P(Z>t_d)=d\,\mathbb P(Z>\sqrt{2q\log d}).
\end{align*}
On a signal coordinate, the exceedance probability is
\begin{align*}
\mathbb P(a_d+Z>t_d)=\mathbb P(Z>\sqrt{2q\log d}-\sqrt{2r\log d}).
\end{align*}
The sparse-side comparison is therefore between many null tails at height $\sqrt{2q\log d}$ and the shifted tails on only $d^{1-\beta}$ planted coordinates. This is why the denser part of the boundary is governed by accumulated quadratic energy, while the very sparse part is governed by rare Gaussian exceedances and higher-criticism-type threshold counts.
[/example]
Sparse normal means shows that detection can be controlled either by accumulated energy or by rare tail counts. The same minimax language also describes low-rank planted matrix models, where the unknown object is a direction rather than a support.
[example: Spectral Spike as a Detection Analogue]
In the rank-one spiked Wigner model
\begin{align*}
Y=\lambda vv^\top+W,
\end{align*}
the null law is $P_d=\mathcal L(W)$, while the alternative is composite because the unit vector $v\in\mathbb R^d$ is unknown. For $\lambda>1$, the rank-one Wigner spike transition gives $\lambda_{\max}(W)\to 2$ under the null and $\lambda_{\max}(\lambda vv^\top+W)\to \lambda+1/\lambda$ under the fixed-$v$ alternative. The limiting gap is positive, since
\begin{align*}
\lambda+\frac1\lambda-2=\frac{\lambda^2-2\lambda+1}{\lambda}=\frac{(\lambda-1)^2}{\lambda}>0.
\end{align*}
Thus any fixed threshold $\tau$ with $2<\tau<\lambda+1/\lambda$ defines the spectral test $\phi(Y)=\mathbb 1_{\{\lambda_{\max}(Y)>\tau\}}$, and the null rejection probability and the fixed-$v$ missed-detection probability both tend to $0$.
For $\lambda<1$, average the alternative over a uniform random direction $v$ and write $L_\pi=dQ_{\pi,d}/dP_d$. The second-moment computation for the rank-one spike gives
\begin{align*}
\mathbb E_{P_d}[L_\pi(Y)^2]=\mathbb E_{v,v'}\exp\left(\frac{d\lambda^2}{2}\langle v,v'\rangle^2\right),
\end{align*}
by *Mixture Second Moment Identity*. Since $\langle v,v'\rangle^2$ has the $\operatorname{Beta}(1/2,(d-1)/2)$ law for independent uniform directions, the preceding overlap estimate gives, for each fixed $\lambda<1$, a constant $C_\lambda<\infty$ such that
\begin{align*}
\sup_d \mathbb E_{P_d}[L_\pi(Y)^2]\le C_\lambda.
\end{align*}
This $L^2$ bound implies uniform integrability. Indeed, for $M>0$, Cauchy--Schwarz gives
\begin{align*}
\mathbb E_{P_d}[L_\pi\mathbb 1_{\{L_\pi>M\}}]\le \left(\mathbb E_{P_d}[L_\pi^2]\right)^{1/2}\left(P_d(L_\pi>M)\right)^{1/2}.
\end{align*}
Markov's inequality and $\mathbb E_{P_d}[L_\pi]=1$ give $P_d(L_\pi>M)\le M^{-1}$, so
\begin{align*}
\mathbb E_{P_d}[L_\pi\mathbb 1_{\{L_\pi>M\}}]\le C_\lambda^{1/2}M^{-1/2}.
\end{align*}
Letting $M\to\infty$ proves uniform integrability, and *Le Cam First Lemma* yields $Q_{\pi,d}\triangleleft P_d$. Therefore every event whose null probability vanishes also has vanishing probability under this mixed planted model.
The analogy with sparse normal means is that the unknown support there and the unknown direction here both create a large composite alternative: detection succeeds only when the signal strength overcomes the size and geometry of the planted class.
[/example]
## What the Chapter Adds to the Minimax Toolkit
The methods of this chapter refine the lower-bound strategy from earlier parts of the course. Le Cam's total variation identity gives the exact simple-testing benchmark; mixture priors turn composite alternatives into tractable simple alternatives; chi-square divergence and second moments make the calculation feasible; contiguity records the resulting indistinguishability at the level of all rare events.
These tools are especially suited to high-dimensional threshold phenomena. They explain why sparse detection, planted submatrix detection, stochastic block models, and spiked random matrices often have sharp boundaries: below threshold the mixture likelihood ratio behaves like a null random variable of mean one, while above threshold a statistic aligned with the planted structure escapes the null fluctuation scale.
Testing and contiguity have shown that some high-dimensional signals are statistically undetectable below sharp thresholds. The next chapter broadens the perspective to modern phenomena where these information-theoretic limits interact with computational constraints and random matrix phase transitions.
# 11. Connections to Modern High-Dimensional Phenomena
Modern high-dimensional statistics is not only about the minimax risk achievable by arbitrary estimators. Many of the estimators that attain information-theoretic rates are computationally infeasible, while the algorithms used in practice often succeed or fail at thresholds predicted by random matrix spectra and iterative dynamics. This chapter connects the minimax and random-matrix tools developed earlier in the course to three modern phenomena: statistical-computational gaps, approximate message passing, and interpolation in overparameterised regression.
The guiding theme is that eigenvalues and overlaps are not merely technical quantities. They encode when a hidden signal can be detected, when an iterative algorithm has a stable informative fixed point, and when an interpolating estimator generalises despite fitting the training data exactly.
## Statistical-Computational Gaps in Sparse PCA
When does the existence of a good estimator fail to translate into a polynomial-time method? Sparse PCA is the standard testing ground because the information-theoretic threshold, the spectral threshold, and conjectural computational threshold separate in a visible way.
[definition: Sparse Spiked Covariance Model]
Let $d,n \in \mathbb N$, let $v \in \mathbb R^d$ satisfy $|v|=1$ and $|\operatorname{supp}(v)| \le k$, and let $\theta > 0$. In the sparse spiked covariance model, the observations $X_1,\dots,X_n \in \mathbb R^d$ are i.i.d. with
\begin{align*}
X_i \sim \mathcal N(0, I_d + \theta vv^\top).
\end{align*}
[/definition]
The model separates dimension, sample size, sparsity, and spike strength, so it lets us ask whether the signal exists before asking how to estimate it. This motivates the associated testing formulation, where the statistician must distinguish pure noise from the presence of some unknown sparse direction.
[definition: Sparse PCA Detection Problem]
For fixed $d,n,k,\theta$, the sparse PCA detection problem is the hypothesis test between the null model $H_0$, where $X_1,\dots,X_n$ are i.i.d. $\mathcal N(0,I_d)$, and the alternative model $H_1$, where $X_1,\dots,X_n$ are i.i.d. $\mathcal N(0,I_d+\theta vv^\top)$ for some $v\in \mathbb R^d$ with $|v|=1$ and $|\operatorname{supp}(v)|\le k$.
[/definition]
Detection is weaker than estimation, so lower bounds for detection also constrain support recovery and vector estimation. This motivates first locating the threshold for unrestricted computation, since it is the benchmark against which algorithmic restrictions will be compared.
[quotetheorem:5954]
This theorem is deliberately about a prior-mixture alternative and an average testing error, not worst-case recovery of every sparse vector. The uniform-sign prior makes the second-moment argument symmetric enough to compute, while the assumptions $k\to\infty$ and $k/d\to0$ keep the overlap of two independent random supports in the sparse regime where the support entropy is comparable to $k\log(d/k)$. At the boundary, if $k$ is fixed or nearly as large as $d$, the same displayed scale no longer captures the correct combinatorics without modification. The average-error formulation is also essential: a second-moment lower bound controls the mixture over spikes, while a worst-case statement would need additional arguments ruling out tests tuned to exceptional supports. A search over all supports can aggregate the leading empirical eigenvalue on every $k$-coordinate principal submatrix, and the entropy term $\log \binom{d}{k}$ is the price of not knowing the support.
[example: Exhaustive Sparse PCA Scan]
Let $S_n=n^{-1}\sum_{i=1}^n X_iX_i^\top$, and scan all $k$-coordinate principal submatrices by
\begin{align*}
T_k=\max_{S\subset\{1,\dots,d\},\ |S|=k}\lambda_{\max}((S_n)_{S,S}).
\end{align*}
For a fixed support $S$ under $H_0$, the restricted vectors $(X_i)_S$ are i.i.d. $\mathcal N(0,I_k)$, so $(S_n)_{S,S}$ is a $k$-dimensional sample covariance. The fixed-support covariance concentration bound gives, for an absolute constant $C$ and any $u>0$,
\begin{align*}
\lambda_{\max}((S_n)_{S,S})\le 1+C\left(\sqrt{\frac{k+u}{n}}+\frac{k+u}{n}\right)
\end{align*}
with probability at least $1-e^{-u}$. Applying this bound to each of the $\binom{d}{k}$ supports and taking $u=\log\binom{d}{k}+t$ yields
\begin{align*}
\mathbb P_{H_0}\left(T_k>1+C\left(\sqrt{\frac{k+\log\binom{d}{k}+t}{n}}+\frac{k+\log\binom{d}{k}+t}{n}\right)\right)\le e^{-t}.
\end{align*}
Since
\begin{align*}
\log\binom{d}{k}\le k\log\left(\frac{ed}{k}\right),
\end{align*}
the leading null fluctuation is of order $\sqrt{k\log(d/k)/n}$ in the regime where the linear term is smaller.
Under $H_1$, let $S_\star=\operatorname{supp}(v)$. If $|S_\star|<k$, enlarge it to a $k$-element set; the Rayleigh quotient in the direction $v$ is unchanged because $v$ is zero outside $S_\star$. Therefore
\begin{align*}
T_k\ge \lambda_{\max}((S_n)_{S_\star,S_\star})\ge v^\top S_n v=\frac1n\sum_{i=1}^n (v^\top X_i)^2.
\end{align*}
Because $X_i\sim\mathcal N(0,I_d+\theta vv^\top)$ and $|v|=1$,
\begin{align*}
\operatorname{Var}(v^\top X_i)=v^\top(I_d+\theta vv^\top)v=v^\top v+\theta(v^\top v)^2=1+\theta.
\end{align*}
Thus $(v^\top X_i)^2$ has the same distribution as $(1+\theta)Z_i^2$ with $Z_i\sim\mathcal N(0,1)$, so the true support contributes an empirical eigenvalue concentrated around $1+\theta$. A threshold just above the null level therefore detects the spike when
\begin{align*}
\theta\gg \sqrt{\frac{k\log(d/k)}{n}}.
\end{align*}
The statistical price is the support entropy in the threshold, while the computational price is that evaluating $T_k$ requires inspecting $\binom{d}{k}$ supports, which is exponential in $k$ in sparse high-dimensional regimes.
[/example]
The preceding example explains why information-theoretic rates need not reflect practical algorithms. Polynomial-time procedures such as ordinary spectral methods and semidefinite relaxations see a different threshold, and planted clique is used as a conjectural barrier to improving it in broad regimes.
[definition: Planted Clique Hypothesis]
The planted clique hypothesis asserts that there is no randomized polynomial-time algorithm that, given a graph on $N$ vertices sampled either from $G(N,1/2)$ or from $G(N,1/2)$ with an added clique of size $\kappa_N$, distinguishes the two models with error probability tending to $0$ when $\kappa_N = o(\sqrt{N})$.
[/definition]
This is a computational conjecture rather than a theorem about sparse PCA itself. Its force comes from reductions, so it motivates a conditional transfer principle: if a randomized polynomial-time reduction maps the planted-clique null close in total variation to the sparse-PCA null, maps the planted-clique alternative close in total variation to sparse-PCA alternatives of sparsity at most $k_N$ and spike at least $\theta_N$, and runs in polynomial time, then a polynomial-time sparse-PCA test with vanishing total error would contradict the planted clique hypothesis.
Each condition is necessary for the conclusion to say anything about sparse PCA. If null closeness fails, the output under $G(N,1/2)$ might contain a detectable non-Gaussian artefact, such as a visibly biased diagonal or entry distribution, and a sparse PCA test could reject that artefact without solving planted clique. If alternative closeness fails, the planted-clique instance might map to a covariance spike with support larger than $k_N$ or strength below $\theta_N$, so success on the stated sparse PCA alternative would not imply success on the reduction output. If the map is not polynomial time, the reduction could hide an exhaustive search for the clique before producing the sparse PCA sample, which would not transfer a polynomial-time sparse PCA algorithm back to a polynomial-time planted clique algorithm. The conclusion is conditional: it does not prove that existing sparse PCA algorithms fail, and it does not rule out a future polynomial-time algorithm unless the planted clique hypothesis and the reduction hypotheses are both accepted. In standard reductions these hypotheses produce the heuristic computational scale
\begin{align*}
\theta\asymp \sqrt{\frac{k^2}{n}},
\end{align*}
so the result is informative only in parameter ranges where those reductions control the total variation errors.
[example: Sparse PCA Gap Regime]
Take a sparse regime in which $k=k_d\to\infty$ and $\log(d/k)=o(k)$. The two relevant scales are
\begin{align*}
a_d=\sqrt{\frac{k\log(d/k)}{n}}
\qquad\text{and}\qquad
b_d=\sqrt{\frac{k^2}{n}}=\frac{k}{\sqrt n}.
\end{align*}
Their ratio is
\begin{align*}
\frac{a_d}{b_d}
=
\frac{\sqrt{k\log(d/k)/n}}{\sqrt{k^2/n}}
=
\sqrt{\frac{k\log(d/k)}{n}\cdot\frac{n}{k^2}}
=
\sqrt{\frac{\log(d/k)}{k}}
\longrightarrow 0,
\end{align*}
so the information-theoretic scale is asymptotically smaller than the planted-clique computational scale.
For a concrete spike strength in the gap, set
\begin{align*}
\theta_d=\sqrt{a_db_d}
=
\left(\sqrt{\frac{k\log(d/k)}{n}}\sqrt{\frac{k^2}{n}}\right)^{1/2}
=
\frac{(k^3\log(d/k))^{1/4}}{\sqrt n}.
\end{align*}
Then
\begin{align*}
\frac{a_d}{\theta_d}
=
\frac{a_d}{\sqrt{a_db_d}}
=
\sqrt{\frac{a_d}{b_d}}
=
\left(\frac{\log(d/k)}{k}\right)^{1/4}
\longrightarrow 0,
\end{align*}
and
\begin{align*}
\frac{\theta_d}{b_d}
=
\frac{\sqrt{a_db_d}}{b_d}
=
\sqrt{\frac{a_d}{b_d}}
=
\left(\frac{\log(d/k)}{k}\right)^{1/4}
\longrightarrow 0.
\end{align*}
Thus
\begin{align*}
\sqrt{\frac{k\log(d/k)}{n}} \ll \theta_d \ll \sqrt{\frac{k^2}{n}}.
\end{align*}
By *[Information-Theoretic Sparse PCA Detection Scale](/theorems/5954)*, exhaustive sparse eigenvalue scanning succeeds once $\theta$ is above the left-hand scale. The planted-clique transfer principle above predicts no polynomial-time test below the right-hand scale, conditional on the reduction hypotheses and the planted clique hypothesis. The same parameter sequence is therefore statistically detectable by unrestricted search while remaining outside the predicted reach of polynomial-time methods.
[/example]
## Approximate Message Passing and State Evolution
How can an iterative algorithm have a deterministic asymptotic theory in a random high-dimensional model? Approximate message passing answers this by adding a correction term to naive iteration, making the effective noise at each step behave like a Gaussian variable whose variance follows a scalar recursion.
[definition: Approximate Message Passing Iteration]
Let $A \in \mathbb R^{n\times d}$ have independent entries with variance $1/n$, and let $y \in \mathbb R^n$ be observations generated from a high-dimensional model. An approximate message passing iteration is a recursion of the form
\begin{align*}
x_{t+1} = \eta_t(A^\top r_t+x_t),
\end{align*}
with residual update
\begin{align*}
r_t = y-Ax_t+b_t r_{t-1},
\end{align*}
where $\eta_t:\mathbb R^d\to \mathbb R^d$ is usually coordinatewise and $b_t$ is the Onsager coefficient determined by the average derivative of $\eta_{t-1}$.
[/definition]
The Onsager term is the feature that separates AMP from ordinary projected gradient or iterative thresholding. It cancels the leading self-interaction created by reusing the same random matrix $A$ at every iteration, which motivates tracking the remaining effective noise through a deterministic scalar recursion.
[definition: State Evolution]
Fix an aspect ratio $\delta\in(0,\infty)$, a noise variance $\sigma^2\ge 0$, a prior random variable $\Theta\in L^2$, and a sequence of scalar denoisers $\eta_t:\mathbb R\to\mathbb R$. The state-evolution transformation is the map
\begin{align*}
\mathcal T_{\delta,\sigma,\Theta,(\eta_t)}:[0,\infty)^{\mathbb N}\to[0,\infty)^{\mathbb N}
\end{align*}
defined by $\mathcal T_{\delta,\sigma,\Theta,(\eta_t)}((\tau_t^2)_{t\ge 0})=(\tilde\tau_t^2)_{t\ge 0}$, where $\tilde\tau_0^2=\tau_0^2$ and
\begin{align*}
\tilde\tau_{t+1}^2 = \sigma^2 + \frac{1}{\delta}\,\mathbb E\left[(\eta_t(\Theta + \tau_t Z)-\Theta)^2\right],
\end{align*}
with $Z\sim \mathcal N(0,1)$ independent of $\Theta$.
[/definition]
In applications, the state-evolution sequence is obtained by fixing an initial value $\tau_0^2$ and iterating this transformation. It predicts the empirical distribution of the AMP effective observation $A^\top r_t+x_t$ in the limit $n,d\to\infty$ with $n/d\to \delta$.
State evolution turns a high-dimensional random algorithm into a one-dimensional dynamical system, but this reduction is not a formal consequence of the [central limit theorem](/theorems/521) alone. The AMP iterates are strongly dependent because the same matrix $A$ is reused at every step, and the Onsager correction is designed precisely to remove the leading dependence that would otherwise accumulate. The next theorem is the rigorous justification for treating the effective AMP coordinate at each fixed time as a scalar Gaussian channel with variance predicted by the recursion.
[quotetheorem:5956]
The hypotheses explain the theorem's narrow but powerful scope. Gaussianity keeps the effective observation close to the scalar Gaussian channel predicted by the recursion; without it, a design with heavy-tailed entries can produce a few oversized columns that dominate $A^\top r_t$ and break that approximation. Separability and Lipschitz regularity keep the denoising step stable under empirical-law convergence; without Lipschitz control, a hard discontinuity such as exact thresholding at a point with positive limiting mass can turn a vanishing perturbation in the effective observation into a non-vanishing change in the empirical law. Bounded second moments prevent a small number of large coordinates from dominating the recursion; for example, if one signal coordinate has size comparable to $\sqrt d$, the empirical mean squared error is no longer represented by a typical coordinate $\Theta$. The fixed-$t$ condition avoids error accumulation over a growing number of iterations; when $t=t_d$ grows with dimension, small approximation errors can accumulate until the state evolution prediction no longer controls the realised iterate. If these assumptions are dropped, AMP can still work, but state evolution may require universality arguments, stronger regularity, damping, or a different Onsager correction.
[example: Soft-Thresholding AMP for Sparse Regression]
Suppose $y=A\theta+w$, with $A_{ij}\sim \mathcal N(0,1/n)$, $w_i\sim \mathcal N(0,\sigma^2)$, and sparse empirical signal prior $\Theta$. For the coordinatewise soft-thresholding denoiser
\begin{align*}
\eta_t(u)=\operatorname{sgn}(u)(|u|-\lambda_t)_+,
\end{align*}
the formula means $\eta_t(u)=u-\lambda_t$ when $u>\lambda_t$, $\eta_t(u)=0$ when $|u|\le \lambda_t$, and $\eta_t(u)=u+\lambda_t$ when $u<-\lambda_t$.
By *[AMP State Evolution Theorem](/theorems/5956)*, for each fixed iteration $t$, the empirical joint law of a signal coordinate, its effective AMP observation, and the next estimate is predicted by
\begin{align*}
(\Theta,\Theta+\tau_t Z,\eta_t(\Theta+\tau_t Z)),
\end{align*}
where $Z\sim\mathcal N(0,1)$ is independent of $\Theta$. Therefore the empirical mean squared error of the AMP estimate is predicted by
\begin{align*}
\mathbb E\left[(\eta_t(\Theta+\tau_t Z)-\Theta)^2\right].
\end{align*}
Writing $U=\Theta+\tau_t Z$, the three threshold regions give the squared error explicitly. On the event $U>\lambda_t$,
\begin{align*}
(\eta_t(U)-\Theta)^2=(U-\lambda_t-\Theta)^2=(\tau_t Z-\lambda_t)^2.
\end{align*}
On the event $|U|\le \lambda_t$,
\begin{align*}
(\eta_t(U)-\Theta)^2=(0-\Theta)^2=\Theta^2.
\end{align*}
On the event $U<-\lambda_t$,
\begin{align*}
(\eta_t(U)-\Theta)^2=(U+\lambda_t-\Theta)^2=(\tau_t Z+\lambda_t)^2.
\end{align*}
Thus choosing $\lambda_t$ reduces to minimizing this scalar Gaussian-channel expectation, rather than rerunning or simulating the full $d$-dimensional AMP recursion for each candidate threshold.
[/example]
AMP also gives an algorithmic interpretation of phase transitions. If the state evolution recursion has an informative stable fixed point that is reachable from the initialization, AMP succeeds; if the informative fixed point exists but is separated by an unstable barrier, the Bayes-optimal estimator may outperform efficient iterations.
## Double Descent and Benign Overfitting
Why can a regression rule that interpolates noisy data still have small prediction risk? The modern answer is spectral: overparameterisation creates many directions, and the risk depends on how the signal and noise align with the eigenvalues of the empirical covariance.
[definition: Ridgeless Least Squares]
Let
\begin{align*}
\mathcal D_{n,d}=\{(X,Y)\in \mathbb R^{n\times d}\times\mathbb R^n:XX^\top \text{ is invertible}\}.
\end{align*}
The ridgeless least squares estimator is the map
\begin{align*}
\hat\beta_{\mathrm{ridgeless}}:\mathcal D_{n,d}\to\mathbb R^d
\end{align*}
defined by
\begin{align*}
\hat\beta_{\mathrm{ridgeless}}(X,Y)=X^\top(XX^\top)^{-1}Y.
\end{align*}
[/definition]
For regression interpretation, $X\in \mathbb R^{n\times d}$ is the design matrix with rows $X_i^\top$, $Y\in \mathbb R^n$, and the data are often written in the linear model form
\begin{align*}
Y = X\beta + \varepsilon,
\end{align*}
where $\beta \in \mathbb R^d$ and $\varepsilon \in \mathbb R^n$. The formula above is the minimum Euclidean norm interpolator whenever $XX^\top$ is invertible.
In the overparameterised regime $d>n$, the equation $Xb=Y$ has infinitely many solutions. The minimum-norm rule selects the solution in the row span of $X$, so we need a test-distribution quantity that measures whether this selected interpolator predicts well.
[definition: Prediction Risk]
Let $x_{\mathrm{new}}\in \mathbb R^d$ be an independent test feature with covariance $\Sigma=\mathbb E[x_{\mathrm{new}}x_{\mathrm{new}}^\top]$, and fix the target parameter $\beta\in\mathbb R^d$. The excess prediction risk is the functional
\begin{align*}
R_\beta:\mathbb R^d\to [0,\infty).
\end{align*}
It is defined by
\begin{align*}
R_\beta(\tilde\beta)=\mathbb E\left[(x_{\mathrm{new}}^\top(\tilde\beta-\beta))^2\mid X\right]
= (\tilde\beta-\beta)^\top\Sigma(\tilde\beta-\beta)
\end{align*}
[/definition]
For an estimator $\hat\beta$, its excess prediction risk is $R_\beta(\hat\beta)$. This risk records the geometry of the test covariance rather than the Euclidean error alone. It motivates decomposing ridgeless regression into the part of the signal outside the row span and the part of the noise amplified by small sample-covariance eigenvalues.
[quotetheorem:5957]
[citeproof:5957]
Each hypothesis has a visible role in the formula. The invertibility of $XX^\top$ makes the displayed minimum-norm interpolator and projection $P_X$ well-defined; if two rows of $X$ are identical, then $XX^\top$ is singular and the inverse in the theorem does not exist, so the Moore-Penrose inverse must replace it and the projection/noise term must be rewritten. The zero conditional noise mean removes the bias-noise cross term; if $\mathbb E[\varepsilon\mid X]=m\ne 0$, an additional term involving $m^\top(XX^\top)^{-1}X\Sigma(I_d-P_X)\beta$ appears. The conditional covariance assumption $\mathbb E[\varepsilon\varepsilon^\top\mid X]=\sigma^2I_n$ turns the noise contribution into the simple trace shown above; if instead the noise covariance is a general matrix $\Omega$, the variance term becomes $\operatorname{tr}(\Sigma X^\top(XX^\top)^{-1}\Omega(XX^\top)^{-1}X)$, so correlations or heteroskedasticity can increase risk in directions selected by the design. The test covariance $\Sigma$ matters because prediction error is measured in the geometry of future data, so small Euclidean error in directions with large test variance can still be costly.
This decomposition also shows why exact interpolation needs a separate asymptotic notion. Near the interpolation threshold, small eigenvalues of $XX^\top$ can make the variance trace large even though the training equations are solved exactly; for instance, isotropic Gaussian designs with $d$ close to $n$ have inverse sample-covariance eigenvalues that create a variance spike. The definition below isolates the exceptional high-dimensional regimes where the bias and variance terms nevertheless vanish in the prediction metric.
[definition: Benign Overfitting]
For each $m\in\mathbb N$, let $\mathcal D_m\subseteq \mathbb R^{n_m\times d_m}\times\mathbb R^{n_m}$ be a data domain, let $\beta_m\in\mathbb R^{d_m}$ be the target parameter, let $\Sigma_m$ be the test covariance, and let
\begin{align*}
\hat\beta_m:\mathcal D_m\to\mathbb R^{d_m}
\end{align*}
be an estimator map. The sequence $(\hat\beta_m)_{m\ge1}$ exhibits benign overfitting if, with probability tending to $1$, it interpolates the training data,
\begin{align*}
X_m\hat\beta_m(X_m,Y_m)=Y_m,
\end{align*}
and its excess prediction risk satisfies
\begin{align*}
R_{\beta_m}^{(m)}(\hat\beta_m(X_m,Y_m))\xrightarrow{\mathbb P}0,
\end{align*}
where $R_{\beta_m}^{(m)}(b)=(b-\beta_m)^\top\Sigma_m(b-\beta_m)$.
[/definition]
Equivalently, in a normalisation that includes observation noise, the full prediction risk converges to the irreducible-noise benchmark. The definition is asymptotic and model-dependent. It does not say that interpolation is always harmless; it says that the spectral geometry of some high-dimensional designs makes the variance term small enough and the bias term negligible enough, motivating spectral conditions that can be checked in concrete covariance models.
[remark: Spectral Conditions for Benign Ridgeless Regression]
Sharp benign-overfitting theorems are model-specific, but the risk decomposition identifies the common spectral mechanism. The bias term is small when the component of $\beta$ outside the row span of $X$ has small $\Sigma$-weighted norm. The variance term is small when the inverse eigenvalue contribution from $XX^\top$ remains controlled after weighting by the test covariance $\Sigma$. Thus overparameterised ridgeless regression is benign only in regimes where the covariance spectrum spreads noise over many weak directions while the signal is concentrated in directions that the sampled row span captures.
[/remark]
This is not a universal theorem because exact necessary and sufficient conditions depend on the feature distribution, eigenvalue decay, signal alignment, and asymptotic relation between $n$ and $d$. For isotropic features with a poorly aligned signal near the interpolation threshold, the inverse-eigenvalue term can dominate and benign overfitting fails. The risk decomposition above is the common starting point for the sharper statements proved in specific covariance models.
[example: Isotropic Ridgeless Regression]
Let $X$ have i.i.d. rows $X_i\sim \mathcal N(0,I_d)$, take $d>n$, and set $\Sigma=I_d$. By *Ridgeless Risk Decomposition*, the conditional excess-risk decomposition is
\begin{align*}
\mathbb E\left[R_\beta(\hat\beta)\mid X\right]=((I_d-P_X)\beta)^\top (I_d-P_X)\beta+\sigma^2\operatorname{tr}\left(X^\top(XX^\top)^{-2}X\right).
\end{align*}
Since $P_X$ is the Euclidean projection onto the row span of $X$, the bias term becomes
\begin{align*}
((I_d-P_X)\beta)^\top (I_d-P_X)\beta=|(I_d-P_X)\beta|^2.
\end{align*}
For Gaussian rows with $d>n$, the matrix $X$ has row rank $n$ with probability $1$, so $P_X$ is a random rank-$n$ projection.
The variance trace reduces by cyclicity of trace:
\begin{align*}
\operatorname{tr}\left(X^\top(XX^\top)^{-2}X\right)=\operatorname{tr}\left((XX^\top)^{-2}XX^\top\right).
\end{align*}
Because $XX^\top$ is invertible with probability $1$ when $d>n$,
\begin{align*}
(XX^\top)^{-2}XX^\top=(XX^\top)^{-1}(XX^\top)^{-1}XX^\top=(XX^\top)^{-1}.
\end{align*}
Therefore
\begin{align*}
\sigma^2\operatorname{tr}\left(X^\top(XX^\top)^{-2}X\right)=\sigma^2\operatorname{tr}\left((XX^\top)^{-1}\right).
\end{align*}
Under the unnormalised convention $X_i\sim\mathcal N(0,I_d)$, the matrix $XX^\top$ has Wishart distribution with $d$ degrees of freedom and scale $I_n$. The inverse-Wishart mean identity gives
\begin{align*}
\mathbb E\left[(XX^\top)^{-1}\right]=\frac{I_n}{d-n-1}
\end{align*}
provided $d>n+1$. Taking traces gives
\begin{align*}
\mathbb E\left[\operatorname{tr}\left((XX^\top)^{-1}\right)\right]=\operatorname{tr}\left(\frac{I_n}{d-n-1}\right).
\end{align*}
Since $\operatorname{tr}(cI_n)=c\,\operatorname{tr}(I_n)$ for scalar $c$ and $\operatorname{tr}(I_n)=n$,
\begin{align*}
\operatorname{tr}\left(\frac{I_n}{d-n-1}\right)=\frac{n}{d-n-1}.
\end{align*}
Thus
\begin{align*}
\mathbb E\left[\operatorname{tr}\left((XX^\top)^{-1}\right)\right]=\frac{n}{d-n-1},
\end{align*}
and the expected variance contribution is
\begin{align*}
\sigma^2\frac{n}{d-n-1}.
\end{align*}
This term grows as $d$ approaches $n+1$, reflecting the spectral instability near interpolation, and it decreases when $d/n$ becomes large under the unnormalised feature scaling used here.
[/example]
This is the random-matrix explanation of double descent. Classical variance increases as the number of parameters approaches the number of samples, the interpolation threshold creates a spectral singularity, and further overparameterisation can reduce prediction error by adding many low-variance directions; this motivates giving the risk-curve shape its own name.
[definition: Double Descent]
Double descent is the risk pattern in which test error decreases in the underparameterised regime, increases near the interpolation threshold, and decreases again in an overparameterised regime.
[/definition]
Double descent is not a separate estimator; it is a shape of the risk curve. In linear models, the shape is visible through inverse sample-covariance eigenvalues, and in nonlinear models related spectral quantities often play the same diagnostic role.
## How the Three Phenomena Fit Together
What do planted clique barriers, AMP recursions, and benign overfitting have in common? Each compares an information-theoretic benchmark with the behaviour of a concrete computational or spectral mechanism.
[explanation: Three Uses of Random Matrix Structure]
Sparse PCA uses random matrix theory to locate spectral thresholds and uses computational conjectures to explain why exhaustive search can beat efficient algorithms. AMP uses Gaussian random matrices to turn an iterative algorithm into a scalar state evolution recursion. Benign overfitting uses the eigenvalues and eigenvectors of the empirical covariance to decompose the risk of an interpolating estimator.
These examples show why minimax theory alone is not the end of high-dimensional statistics. A minimax lower bound says what no estimator can do; a computational lower bound says what no efficient estimator is believed to do; a state evolution theorem predicts what a specific algorithm does; and a spectral risk decomposition explains why an estimator that appears statistically dangerous can still generalise.
[/explanation]
The practical lesson is to ask four questions for any modern high-dimensional method. What is the information-theoretic threshold? What is the best known polynomial-time threshold? What spectral object controls the algorithm? What asymptotic recursion or risk decomposition predicts its performance? Chapters 1 through 5 supplied the information-theoretic tools for the first question, while Chapters 6 through 9 supplied the spectral tools for the third; this chapter shows how they interact in contemporary problems.
The course has now connected lower bounds, sparse recovery, random matrix theory, and planted-structure detection into one picture. The final chapter distills that picture into reusable proof templates and rate tables, so the main arguments can be applied efficiently in new problems.
# 12. Synthesis: Rate Tables and Proof Templates
This final chapter collects the proof templates and rate statements that recur throughout the course. The aim is not to introduce a new lower-bound inequality, but to show how the testing tools of Chapters 1 through 3, the sparse linear and compressed-sensing examples of Chapters 4 and 5, and the random-matrix tools of Chapters 6 through 11 fit together when proving a complete minimax theorem. The reader should keep in view the three ingredients that every rate statement needs: a statistical model, a parameter class, and a loss.
## Choosing a Lower-Bound Method
A lower bound begins with a choice of reduction. Studying only pointwise risk can be misleading, because an estimator may perform well at a single convenient parameter while failing badly elsewhere in the class. Minimax analysis forces the estimator to work uniformly, and lower bounds exploit this by finding a finite subproblem on which uniform performance would imply an impossible testing procedure. The practical question is: does the parameter class contain two quantitatively separated points, many well-separated points, a product hypercube, a planted mixture whose likelihood ratio has bounded second moment, or a sequence of experiments with the same limiting events? Each answer points to a different method.
[definition: Minimax Risk]
Let $(\mathcal Y_n,\mathcal A_n,(P_\theta)_{\theta \in \Theta})$ be a statistical experiment with observation space $\mathcal Y_n$, observation $Y$, parameter space $\Theta$, and loss $L: \Theta \times \Theta \to [0,\infty)$. The minimax risk over $\Theta$ is
\begin{align*}
R_n^*(\Theta, L) = \inf_{\hat{\theta}:\mathcal Y_n\to\Theta} \sup_{\theta \in \Theta} \mathbb E_\theta[L(\hat{\theta}(Y), \theta)],
\end{align*}
where the infimum is over all $\mathcal A_n$-measurable estimators.
[/definition]
The whole lower-bound toolkit is a collection of ways to prove that every estimator must make enough testing mistakes on a carefully chosen finite subset of $\Theta$. The finite subset must be separated in the target loss while remaining statistically hard to distinguish.
[explanation: Lower-Bound Method Selection]
Let $\Theta$ be a parameter class and suppose the goal is to lower bound $R_n^*(\Theta,L)$. The following choices are the standard route.
If $\Theta$ contains two points $\theta_0,\theta_1$ separated in loss and $P_{\theta_0},P_{\theta_1}$ have quantitatively small total variation distance, Le Cam's two-point method is the natural first attempt. This is strongest when the hard part of the class is already visible in a binary subproblem.
If $\Theta$ contains a packing $\{\theta_1,\dots,\theta_M\}$ with $M$ large, pairwise loss separation at least $2s$, and average Kullback-Leibler divergence small compared with $\log M$, Fano's inequality gives a lower bound of order $s$. This is the template behind logarithmic factors such as $\log(ed/k)$.
If $\Theta$ contains a hypercube $\{\theta_v:v\in\{0,1\}^m\}$ whose coordinate flips are separated in loss and whose adjacent distributions are hard to distinguish, Assouad's lemma gives a lower bound proportional to the dimension of the hypercube. This is most efficient when the risk decomposes into many coordinatewise mistakes.
If the null and mixture alternative have likelihood ratio $L_n$ satisfying a bounded second-moment condition under the null, the chi-square method gives indistinguishability and hence a detection lower bound. This method is suited to planted structures where the alternative is a mixture over many hidden configurations.
If experiments under the alternative are contiguous to experiments under the null, no test has vanishing type I and type II errors simultaneously along that regime. Contiguity packages a second-moment or limiting experiment calculation into an asymptotic impossibility statement.
[/explanation]
This theorem is a decision guide rather than a replacement for construction. The rate is hidden in the geometry of the finite subset: the logarithm of a packing size, the dimension of a hypercube, or the second moment of a planted mixture.
[example: Selecting Methods Across Models]
For sparse Gaussian mean estimation, let $Y=\theta+\sigma Z$ with $Z\sim\mathcal N(0,I_d)$, and choose a Varshamov-Gilbert family $\omega_1,\dots,\omega_M\in\{0,1\}^d$ with $\|\omega_j\|_0=k$, pairwise Hamming distance $d_H(\omega_i,\omega_j)\ge k/2$, and $\log M\ge c k\log(ed/k)$. Put $\theta_j=a\omega_j$. For $i\ne j$,
\begin{align*}
|\theta_i-\theta_j|^2=a^2|\omega_i-\omega_j|^2=a^2d_H(\omega_i,\omega_j)\ge \frac{a^2k}{2}.
\end{align*}
The Gaussian Kullback-Leibler divergence between two shifted normal models is
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_i,\sigma^2I_d)\,\middle\|\,\mathcal N(\theta_j,\sigma^2I_d)\right)=\frac{|\theta_i-\theta_j|^2}{2\sigma^2}=\frac{a^2d_H(\omega_i,\omega_j)}{2\sigma^2}\le \frac{a^2k}{\sigma^2},
\end{align*}
because two $k$-sparse binary vectors differ in at most $2k$ coordinates. Choosing $a^2=\alpha\sigma^2\log(ed/k)$ with $\alpha>0$ small makes the information scale at most a small constant multiple of $k\log(ed/k)$, while the squared loss separation is at least
\begin{align*}
\frac{a^2k}{2}=\frac{\alpha}{2}\sigma^2k\log(ed/k).
\end{align*}
Thus the logarithmic factor comes from the number of sparse supports, and Fano is the method whose information budget matches that packing geometry.
For estimating a dense mean vector over a cube, take the hypercube $\theta_v=av$ with $v\in\{-1,1\}^d$. If $v$ and $v'$ differ in exactly one coordinate, then
\begin{align*}
|\theta_v-\theta_{v'}|^2=a^2|v-v'|^2=4a^2.
\end{align*}
Each coordinate flip therefore contributes the same amount of squared loss, so a lower bound that sums coordinatewise testing errors gives a factor proportional to $d$. This is why Assouad is the natural method for cube-like dense classes.
For sparse PCA detection near a weak-signal boundary, the hidden object is a planted configuration rather than a single fixed point. If $Q$ is the mixture alternative, $P$ is the null, and $L=dQ/dP$, then the chi-square calculation compares two independent planted configurations:
\begin{align*}
\mathbb E_P[L^2]=\mathbb E_{(S,\varepsilon),(S',\varepsilon')}\!\left[\exp\{\text{signal strength}\cdot \text{overlap}((S,\varepsilon),(S',\varepsilon'))\}\right].
\end{align*}
When the random overlap is typically small enough that this second moment stays bounded, the mixture alternative is contiguous to the null. Across these models, the cleanest lower-bound proof mirrors the parameter geometry: packings for sparse supports, hypercubes for coordinatewise loss, and mixture second moments for planted structure.
[/example]
## Matching Upper and Lower Bounds
Once a lower bound is known, the next question is what kind of upper bound should match it. A statistical upper bound may use any estimator, including computationally infeasible searches. An algorithmic upper bound is attached to a concrete polynomial-time procedure. A random-design bound may depend on concentration of the design matrix and may fail for deterministic designs without restricted eigenvalue or restricted isometry assumptions.
[definition: Statistical and Algorithmic Rates]
Let $r_n(\Theta)$ be a positive sequence. A statistical rate $r_n(\Theta)$ is minimax-optimal for loss $L$ if
\begin{align*}
R_n^*(\Theta,L) \asymp r_n(\Theta),
\end{align*}
with constants depending only on fixed model parameters.
Fix a data space $\mathcal Y_n$, an estimator class $\mathcal A_n$ consisting of measurable maps $\mathcal Y_n\to\Theta$, and a computational model specifying which maps in $\mathcal A_n$ are admissible, such as polynomial-time algorithms. An algorithmic rate for an estimator $\hat{\theta}_n\in\mathcal A_n$ is a bound of the form
\begin{align*}
\sup_{\theta \in \Theta}\mathbb E_\theta[L(\hat{\theta}_n,\theta)] \lesssim r_n(\Theta),
\end{align*}
where $\hat{\theta}_n$ belongs to the admissible computational class.
[/definition]
After closing the minimax gap, it remains important to record whether the proof used an oracle, an exhaustive search, a convex program, a spectral estimator, or a random-matrix event. This distinction explains why some entries in the rate table are information-theoretic and others are achieved by practical algorithms.
The matching principle is a proof template rather than a new theorem. A lower bound of the form $R_n^*(\Theta,L)\ge c r_n(\Theta)$ and an upper bound from some estimator with worst-case risk at most $C r_n(\Theta)$ together identify the minimax rate up to constants. The constants $c$ and $C$ matter only through their independence of $n$ and of the varying dimensions or sparsity parameters. If the lower bound is proved on a smaller subclass of $\Theta$, it is still valid for $\Theta$, but the upper bound must hold on the whole class being claimed. If the estimator is an exhaustive search, the argument proves a statistical rate but does not by itself give a practical algorithmic rate. This distinction is the reason rate tables should record both the estimator and the proof method.
The principle is modest, but it prevents a common mistake: a minimax result is not proved by a lower bound alone, and an algorithmic guarantee is not the same as a statement about all estimators. In high dimensions the design assumption is often the point at which the two stories separate.
Two counterexamples explain the hypotheses. A Fano lower bound for sparse regression on a well-conditioned Gaussian design does not prove the same parameter-risk rate for a deterministic design with two identical columns: the vectors $\beta=a e_1$ and $\beta=a e_2$ give the same mean response, so no estimator can distinguish their coordinates from the data. Conversely, an exhaustive search over all $k$-subsets may achieve the information-theoretic sparse regression rate under restricted eigenvalues, but this construction does not imply that the Lasso, thresholded gradient descent, or any other stated polynomial-time algorithm achieves that rate.
[example: Random Design Versus Fixed Design in Sparse Regression]
Consider $Y=X\beta+w$ with $w\sim\mathcal N(0,\sigma^2I_n)$ and $\|\beta\|_0\le k$. In the random Gaussian-design formulation, the useful event is that the empirical Gram matrix preserves sparse Euclidean norms: there are constants $0<c_-<c_+<\infty$ such that, for every $u$ with $\|u\|_0\le 2k$,
\begin{align*}
c_-|u|^2
\le \frac{|Xu|^2}{n}
\le c_+|u|^2 .
\end{align*}
For Gaussian designs with the usual column normalization, this event holds with high probability once $n\gtrsim k\log(ed/k)$. On this event, any prediction bound for a $k$-sparse error vector converts into a parameter bound by rearranging the left inequality:
\begin{align*}
c_-|\hat\beta-\beta|^2
\le \frac{|X(\hat\beta-\beta)|^2}{n},
\qquad
|\hat\beta-\beta|^2
\le \frac{1}{c_-}\frac{|X(\hat\beta-\beta)|^2}{n}.
\end{align*}
Thus the usual sparse-regression oracle bound
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta)|^2
\lesssim \frac{\sigma^2 k\log(ed/k)}{n}
\end{align*}
gives
\begin{align*}
|\hat\beta-\beta|^2
\lesssim \frac{\sigma^2 k\log(ed/k)}{n},
\end{align*}
with the restricted-eigenvalue constant absorbed into the implicit constant.
For a fixed deterministic design, the same implication is not automatic. If the first two columns of $X$ are identical and $\beta^{(1)}=a e_1$, $\beta^{(2)}=a e_2$, then
\begin{align*}
X\beta^{(1)}=aXe_1=aXe_2=X\beta^{(2)},
\end{align*}
so the two models have the same distribution:
\begin{align*}
\mathcal N(X\beta^{(1)},\sigma^2I_n)=\mathcal N(X\beta^{(2)},\sigma^2I_n).
\end{align*}
However,
\begin{align*}
|\beta^{(1)}-\beta^{(2)}|^2
=|a e_1-a e_2|^2
=a^2|e_1-e_2|^2
=2a^2.
\end{align*}
For any estimator $\hat\beta$, the triangle inequality gives
\begin{align*}
|\beta^{(1)}-\beta^{(2)}|
\le |\hat\beta-\beta^{(1)}|+|\hat\beta-\beta^{(2)}|,
\end{align*}
and therefore at least one of $|\hat\beta-\beta^{(1)}|$ and $|\hat\beta-\beta^{(2)}|$ is at least $|\beta^{(1)}-\beta^{(2)}|/2=a/\sqrt2$. Since the two data distributions are identical, no estimator can learn which coordinate carried the signal. This is why random-design sparse regression needs a high-probability restricted-eigenvalue event, while fixed-design sparse regression must assume such a condition explicitly.
[/example]
## Canonical Minimax Rate Table
The course has returned several times to the same four quantities: sample size $n$, ambient dimension $d$, sparsity $k$, and signal-to-noise scale. This section gathers the canonical rates and records which proof template explains each line. Constants and lower-order logarithms vary across exact normalisations; the table is meant to identify the leading high-dimensional dependence rather than to introduce a new theorem.
| Setting | Canonical scale | Main mechanism |
| --- | --- | --- |
| $k$-sparse Gaussian sequence estimation | $\sigma^2 k\log(ed/k)$ | sparse support entropy plus Fano packing |
| sparse linear regression, well-conditioned design | $\sigma^2 k\log(ed/k)/n$ | sparse packing plus restricted eigenvalue transfer |
| covariance estimation, Frobenius loss | $(d^2/n)\wedge d$ | entrywise variance accumulation with diameter truncation |
| covariance estimation, operator loss | $(\sqrt{d/n}+d/n)^2\wedge 1$ | spectral concentration and edge fluctuation |
| rank-one PCA subspace loss | $(d/(n\lambda^2))\wedge 1$ | covariance fluctuation plus Davis-Kahan perturbation |
| uniform compressed sensing recovery | $n\gtrsim k\log(ed/k)$ | sparse entropy and restricted isometry |
Each line abbreviates a precise theorem proved earlier or quoted in the relevant chapter, so the hypotheses are part of the comparison. In sparse regression, the random-design line is an unconditional statement for the joint experiment $(X,Y)$, while the fixed-design line is a conditional statement after restricting to matrices whose restricted eigenvalues are controlled. Mixing these two formulations without saying which expectation is being taken would hide the step that converts prediction control into parameter control. Without column normalisation or restricted eigenvalues, two distinct sparse vectors can give nearly the same mean response, so parameter error is not controlled by prediction error.
In covariance estimation, bounded spectra fix the operator scale but do not make the Frobenius diameter constant. The Frobenius entry therefore says that entrywise estimation accumulates over about $d^2$ directions only until it reaches the order-$d$ squared diameter of the class. The operator-norm entry has a different truncation because the bounded-spectrum diameter in operator norm is constant. Without bounded spectra, the scale of the covariance matrix itself changes both losses.
Without truncation in PCA, the expression
\begin{align*}
\frac{d}{n\lambda^2}
\end{align*}
can exceed the maximal possible squared sine loss, so the displayed rate would no longer describe a risk. In compressed sensing, Gaussian measurements supply restricted isometry uniformly over sparse vectors; an adversarial measurement matrix with a sparse vector in its nullspace gives identical noiseless data for two different sparse signals. In sparse PCA, the information-theoretic sparse-sphere packing rate does not say that a plain leading-eigenvector estimator finds a sparse direction, because dense noise coordinates can dominate unless the algorithm enforces sparsity or the signal is stronger.
These qualifications point forward to the proof templates. Sparse means and sparse regression use packing arguments, but sparse regression needs an additional design step before prediction separation becomes parameter separation. Covariance Frobenius risk is a dimension-counting problem with a class-diameter cap, whereas covariance operator risk is a spectral-fluctuation problem. PCA then uses the spectral fluctuation as input to a deterministic perturbation inequality, and compressed sensing uses the same sparse entropy through uniform embedding rather than through estimation risk.
The theorem is a compressed map of the course. It should be read together with the proof templates below: rates are not memorised constants, but outputs of a small number of recurring mechanisms.
[example: Full Rate Table]
In Gaussian sequence estimation $Y=\theta+\sigma Z$ with $Z\sim\mathcal N(0,I_d)$, the dense Euclidean ball $\{\theta:|\theta|\le R\}$ has two visible benchmark errors. The zero estimator has risk
\begin{align*}
\mathbb E_\theta[|0-\theta|^2]=|\theta|^2\le R^2,
\end{align*}
while the unbiased estimator $\hat\theta=Y$ has risk
\begin{align*}
\mathbb E_\theta[|Y-\theta|^2]
=\mathbb E[|\sigma Z|^2]
=\sigma^2\sum_{\ell=1}^d \mathbb E[Z_\ell^2]
=\sigma^2 d.
\end{align*}
Thus the natural dense scale is $\min\{R^2,\sigma^2 d\}$. For the $k$-sparse class, the packing calculation uses about $\exp\{c k\log(ed/k)\}$ supports, so the sparse sequence scale is
\begin{align*}
\sigma^2\cdot k\log(ed/k),
\end{align*}
with $k$ replacing $d$ only after paying the support-selection logarithm.
In sparse regression with Gaussian design, a prediction bound of the form
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta)|^2
\lesssim \frac{\sigma^2 k\log(ed/k)}{n}
\end{align*}
becomes a parameter bound when the restricted eigenvalue inequality holds on $2k$-sparse vectors:
\begin{align*}
c_-|\hat\beta-\beta|^2
\le \frac{1}{n}|X(\hat\beta-\beta)|^2.
\end{align*}
Since $\hat\beta-\beta$ is supported on at most $2k$ coordinates in the ideal sparse comparison, division by $c_-$ gives
\begin{align*}
|\hat\beta-\beta|^2
\le \frac{1}{c_-}\frac{1}{n}|X(\hat\beta-\beta)|^2
\lesssim \frac{\sigma^2 k\log(ed/k)}{n}.
\end{align*}
The same sparse entropy $k\log(ed/k)$ appears, but the factor $1/n$ comes from averaging $n$ noisy linear measurements.
For covariance estimation, Frobenius loss sums squared entrywise errors:
\begin{align*}
\|\hat\Sigma-\Sigma\|_F^2
=\sum_{r=1}^d\sum_{s=1}^d(\hat\Sigma_{rs}-\Sigma_{rs})^2.
\end{align*}
There are order $d^2$ matrix entries, and each empirical covariance entry has variance of order $1/n$ under bounded-spectrum Gaussian sampling, so the accumulated scale is
\begin{align*}
d^2\cdot \frac{1}{n}=\frac{d^2}{n},
\end{align*}
until the bounded-spectrum diameter truncates the risk at order $d$. Operator loss instead tracks the largest singular fluctuation:
\begin{align*}
\|\hat\Sigma-\Sigma\|_{\mathrm{op}}
\lesssim \sqrt{\frac dn}+\frac dn,
\end{align*}
so in the regime $d/n=O(1)$ and $d/n$ small,
\begin{align*}
\|\hat\Sigma-\Sigma\|_{\mathrm{op}}^2
\lesssim \left(\sqrt{\frac dn}+\frac dn\right)^2
=\frac dn+2\left(\frac dn\right)^{3/2}+\left(\frac dn\right)^2
\asymp \frac dn.
\end{align*}
In rank-one PCA with $\Sigma=I_d+\lambda vv^\top$, the population eigengap is
\begin{align*}
(1+\lambda)-1=\lambda.
\end{align*}
Combining the same covariance fluctuation scale with the Davis-Kahan perturbation step gives
\begin{align*}
\sin\angle(\hat v,v)
\lesssim \frac{\|\hat\Sigma-\Sigma\|_{\mathrm{op}}}{\lambda}
\lesssim \frac{\sqrt{d/n}+d/n}{\lambda}.
\end{align*}
When $d/n=O(1)$ and the spike is separated so that $\sqrt{d/n}$ is the leading fluctuation term, squaring yields
\begin{align*}
\sin^2\angle(\hat v,v)
\lesssim \frac{d/n}{\lambda^2}
=\frac{d}{n\lambda^2}.
\end{align*}
Thus the table is read by matching each loss to the quantity it accumulates: coordinates for dense means, sparse supports for sparse problems, matrix entries for Frobenius covariance loss, spectral fluctuation for operator loss, and spectral fluctuation divided by eigengap for PCA.
[/example]
## Packing-Based Lower-Bound Template
Most minimax lower bounds in the course follow the same chain: build a finite parameter set, prove loss separation, bound information, and decode estimation into testing. The problem is to choose the packing scale so that separation is large while information remains small.
[explanation: Packing Lower-Bound Template]
Start with a target scale $s>0$ and construct parameters $\theta_1,\dots,\theta_M\in\Theta$ such that $L(\theta_i,\theta_j)\ge 2s$ for $i\ne j$. Let $V$ be uniform on $\{1,\dots,M\}$ and let the data be drawn from $P_{\theta_V}$. Any estimator $\hat\theta$ induces a decoder $\hat V$ by choosing an index whose parameter is closest to $\hat\theta$ in loss.
The separation condition implies that small estimation loss forces correct decoding. Fano's inequality then gives
\begin{align*}
\inf_{\hat\theta}\sup_{\theta\in\Theta}\mathbb E_\theta[L(\hat\theta,\theta)]
\ge s\left(1-\frac{I(V;Y)+\log 2}{\log M}\right),
\end{align*}
up to harmless changes in constants depending on the exact loss normalisation. The information term is usually bounded by
\begin{align*}
I(V;Y) \le \frac{1}{M^2}\sum_{i,j=1}^M D_{\mathrm{KL}}(P_{\theta_i}\|P_{\theta_j})
\end{align*}
or by comparison with a centre point $P_{\theta_0}$. Optimising the amplitude of the packing gives the final rate.
[/explanation]
This template explains why logarithmic factors are geometric. In sparse problems, there are roughly $\exp\{c k\log(ed/k)\}$ distinguishable supports, and the information budget determines how much signal can be placed on each chosen coordinate.
[example: Sparse Gaussian Means via Fano]
Let $Y=\theta+\sigma Z$ with $Z\sim\mathcal N(0,I_d)$, and choose a Varshamov-Gilbert packing $\omega_1,\dots,\omega_M\in\{0,1\}^d$ with $\|\omega_j\|_0=k$ such that $d_H(\omega_i,\omega_j)\ge k/2$ for $i\ne j$ and $\log M\ge c_0 k\log(ed/k)$ for a numerical constant $c_0>0$. Set $\theta_j=a\omega_j$. Since $\omega_i-\omega_j$ has entries in $\{-1,0,1\}$, its squared Euclidean norm equals its Hamming distance, so for $i\ne j$,
\begin{align*}
|\theta_i-\theta_j|^2=a^2|\omega_i-\omega_j|^2=a^2d_H(\omega_i,\omega_j)\ge \frac{a^2k}{2}.
\end{align*}
For shifted Gaussian sequence models with common covariance $\sigma^2I_d$,
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_i,\sigma^2I_d)\,\middle\|\,\mathcal N(\theta_j,\sigma^2I_d)\right)=\frac{|\theta_i-\theta_j|^2}{2\sigma^2}.
\end{align*}
Each $\omega_i$ and $\omega_j$ has exactly $k$ nonzero coordinates, so their supports differ in at most $2k$ coordinates. Hence
\begin{align*}
|\theta_i-\theta_j|^2=a^2d_H(\omega_i,\omega_j)\le 2a^2k,
\end{align*}
and therefore
\begin{align*}
D_{\mathrm{KL}}\!\left(\mathcal N(\theta_i,\sigma^2I_d)\,\middle\|\,\mathcal N(\theta_j,\sigma^2I_d)\right)\le \frac{a^2k}{\sigma^2}.
\end{align*}
Choose
\begin{align*}
a^2=\alpha\sigma^2\log\left(\frac{ed}{k}\right)
\end{align*}
with $\alpha>0$ small enough that $\alpha k\log(ed/k)\le \log M/8$. Then the average pairwise Kullback-Leibler divergence is at most $\log M/8$. By *Fano's inequality*, any estimator must make a constant-probability error when used to identify the index $j$ in the finite experiment $\{\theta_1,\dots,\theta_M\}$.
Given an estimator $\hat\theta$, decode by choosing a nearest packing point to $\hat\theta$. If the decoder makes an error under $\theta_j$, then the triangle inequality forces
\begin{align*}
|\hat\theta-\theta_j|\ge \frac12\min_{i\ne j}|\theta_i-\theta_j|.
\end{align*}
Using the separation already proved,
\begin{align*}
|\hat\theta-\theta_j|^2\ge \frac14\cdot\frac{a^2k}{2}=\frac{a^2k}{8}
\end{align*}
on the decoding-error event. Thus
\begin{align*}
\inf_{\hat\theta}\sup_{\|\theta\|_0\le k}\mathbb E_\theta[|\hat\theta-\theta|^2]\ge c\,a^2k=c\,\alpha\sigma^2 k\log\left(\frac{ed}{k}\right)
\end{align*}
for a numerical constant $c>0$. The logarithmic factor is the price of not knowing the support: the packing contains exponentially many sparse supports, while the amplitude is chosen so that those alternatives remain statistically hard to distinguish.
[/example]
## Spectral-Limit and Perturbation Template
Random-matrix arguments have a different shape. The problem is not to count packings, but to separate deterministic structure from stochastic spectral fluctuation. The estimator is usually an eigenvalue, eigenspace, or resolvent functional, and the proof has to control both the limiting spectrum and the finite-sample deviation.
[explanation: Spectral-Limit Template]
For sample covariance matrices, write the object of interest as a deterministic signal plus a random perturbation. First identify the limiting spectral distribution or edge location for the noise-only matrix, often through the Stieltjes transform. Then prove a high-probability finite-sample bound placing the empirical spectrum near that limit. Finally use a perturbation inequality, such as Weyl's inequality or Davis-Kahan, to convert spectral separation into eigenvalue or eigenspace accuracy.
In spiked models, the same template has an additional threshold step. A spike below the spectral edge is absorbed into the bulk and is statistically hard to detect by spectral methods. A spike above the edge creates an outlier eigenvalue and a nonzero limiting alignment between the empirical and population eigenvectors. The gap between the outlier and the bulk controls the stability of the associated eigenspace.
[/explanation]
This template also shows why random matrix limits and non-asymptotic concentration appear together in the course. If the perturbation is comparable to the eigengap, the leading empirical eigenvector may rotate substantially even when the sample covariance is a reasonable operator-norm approximation. The next theorem isolates the deterministic perturbation step needed after a spectral event has been proved: once the sample covariance is close to the population covariance in operator norm by a fixed fraction of the population eigengap, eigenspace error is controlled by that eigengap.
[quotetheorem:5960]
[citeproof:5960]
The perturbation bound is deterministic conditional on a spectral event. The eigengap hypothesis is essential: if the leading eigenvalue is not separated, the leading eigenspace is not stable under small perturbations. A concrete failure occurs when $\Sigma=\operatorname{diag}(2,2,1,\dots,1)$: every unit vector in the span of $e_1$ and $e_2$ is a leading eigenvector, so there is no identifiable unique top direction to estimate. The theorem does not prove that the event $\|\hat\Sigma-\Sigma\|_{\mathrm{op}}\le\lambda/4$ occurs; that requires a separate concentration or random-matrix input. Once that separate input gives $\|\hat\Sigma-\Sigma\|_{\mathrm{op}}\lesssim \sqrt{d/n}+d/n$ with high probability, the familiar PCA rate follows after taking expectations or integrating the tail bound. This separation of deterministic perturbation and probabilistic control is the pattern used in the PCA rate table.
The statistical content lies in proving that the event occurs with high probability and in understanding when the eigengap is larger than the random spectral fluctuation.
[example: Spectral Rate for Rank-One PCA]
In the rank-one spiked covariance model, write
\begin{align*}
\Sigma=I_d+\lambda vv^\top,\qquad |v|=1,\qquad \lambda>0,
\end{align*}
and let $\hat v$ be a leading eigenvector of the sample covariance $\hat\Sigma$. The population top direction is $v$, because
\begin{align*}
\Sigma v=(I_d+\lambda vv^\top)v=v+\lambda v(v^\top v)=(1+\lambda)v.
\end{align*}
If $u\perp v$, then $v^\top u=0$, so
\begin{align*}
\Sigma u=(I_d+\lambda vv^\top)u=u+\lambda v(v^\top u)=u.
\end{align*}
Thus the top population eigenvalue is $1+\lambda$, all orthogonal directions have eigenvalue $1$, and the population eigengap is
\begin{align*}
(1+\lambda)-1=\lambda.
\end{align*}
Suppose the sample covariance satisfies the spectral concentration event
\begin{align*}
\|\hat\Sigma-\Sigma\|_{\mathrm{op}}\le C\left(\sqrt{\frac dn}+\frac dn\right),
\end{align*}
and suppose this upper bound is at most $\lambda/4$. By *Davis-Kahan Rank-One PCA Bound*,
\begin{align*}
\sin\angle(\hat v,v)\le \frac{2\|\hat\Sigma-\Sigma\|_{\mathrm{op}}}{\lambda}.
\end{align*}
Substituting the concentration bound gives
\begin{align*}
\sin\angle(\hat v,v)\le \frac{2C}{\lambda}\left(\sqrt{\frac dn}+\frac dn\right).
\end{align*}
Squaring both sides gives
\begin{align*}
\sin^2\angle(\hat v,v)\le \frac{4C^2}{\lambda^2}\left(\sqrt{\frac dn}+\frac dn\right)^2.
\end{align*}
Expanding the square,
\begin{align*}
\left(\sqrt{\frac dn}+\frac dn\right)^2=\frac dn+2\left(\frac dn\right)^{3/2}+\left(\frac dn\right)^2.
\end{align*}
If $d/n\le \gamma$ for a fixed constant $\gamma$, then
\begin{align*}
2\left(\frac dn\right)^{3/2}=2\sqrt{\frac dn}\frac dn\le 2\sqrt{\gamma}\frac dn.
\end{align*}
Also,
\begin{align*}
\left(\frac dn\right)^2\le \gamma\frac dn.
\end{align*}
Therefore
\begin{align*}
\sin^2\angle(\hat v,v)\le \frac{4C^2(1+2\sqrt{\gamma}+\gamma)}{\lambda^2}\frac dn.
\end{align*}
Hence, in the separated regime with bounded $d/n$, the leading squared sine error scale is
\begin{align*}
\frac{d}{n\lambda^2}.
\end{align*}
The rate comes from dividing the random spectral fluctuation by the deterministic eigengap and then squaring; when the spike approaches the spectral edge, the effective gap shrinks and the same perturbation step yields a larger eigenspace error.
[/example]
## Detection, Chi-Square Bounds, and Contiguity
Some questions are sharper as detection problems than as estimation problems. The guiding question is whether the likelihood ratio under a structured alternative has enough second moment under the null to separate the two experiments. If the answer is no, estimation of the hidden structure cannot be reliable in that regime.
[definition: Contiguity]
Let $(P_n)$ and $(Q_n)$ be sequences of probability measures on measurable spaces $(\Omega_n,\mathcal F_n)$. The sequence $(Q_n)$ is contiguous with respect to $(P_n)$ if for every sequence of events $A_n\in\mathcal F_n$,
\begin{align*}
P_n(A_n)\to 0 \implies Q_n(A_n)\to 0.
\end{align*}
Mutual contiguity means each sequence is contiguous with respect to the other.
[/definition]
Contiguity is the asymptotic language behind bounded second moments. For detection lower bounds, the obstruction is that a successful test would need rejection events whose null probabilities vanish while their alternative probabilities stay bounded away from zero.
The next question is therefore not just what contiguity means, but how to prove it from the likelihood-ratio estimates that arise in practice. In second-moment arguments one usually computes $L_n=dQ_n/dP_n$ under the null and tries to show that it cannot concentrate too much mass on rare null events.
The criterion below gives exactly that bridge. It says that a uniform $L^2(P_n)$ bound on the likelihood ratios prevents rare events under $P_n$ from carrying substantial probability under $Q_n$, turning a bounded second moment into one-sided contiguity.
[quotetheorem:5944]
[citeproof:5944]
The absolute-continuity hypothesis is needed so that the likelihood ratio $L_n=dQ_n/dP_n$ exists; without it, the displayed Cauchy-Schwarz argument has no object to apply to. The theorem gives one-sided contiguity only, so mutual contiguity requires applying an analogous argument in the other direction or using a stronger criterion. A bounded second moment rules out tests whose rejection events have vanishing null probability and nonvanishing alternative probability, which is exactly the obstruction used in detection lower bounds.
The calculation required by the theorem is often the most delicate part of a detection lower bound. The mixture must be broad enough to hide the signal but structured enough that the second moment can be evaluated.
[example: Simple Gaussian Spike Second Moment]
Let $m=|\mathcal V|$, let $P=\mathcal N(0,I_d)$, and let $Q$ be the mixture alternative
\begin{align*}
Q=\frac1m\sum_{v\in\mathcal V}\mathcal N(\mu v,I_d).
\end{align*}
For fixed $v\in\mathcal V$, the density ratio of $\mathcal N(\mu v,I_d)$ with respect to $\mathcal N(0,I_d)$ is
\begin{align*}
\frac{d\mathcal N(\mu v,I_d)}{d\mathcal N(0,I_d)}(y)=\exp\left\{-\frac12|y-\mu v|^2+\frac12|y|^2\right\}.
\end{align*}
Since $|v|=1$,
\begin{align*}
|y-\mu v|^2=|y|^2-2\mu v\cdot y+\mu^2.
\end{align*}
Substituting this expansion gives
\begin{align*}
-\frac12|y-\mu v|^2+\frac12|y|^2=\mu v\cdot y-\frac{\mu^2}{2}.
\end{align*}
Hence
\begin{align*}
\frac{d\mathcal N(\mu v,I_d)}{d\mathcal N(0,I_d)}(y)=\exp\left\{\mu v\cdot y-\frac{\mu^2}{2}\right\},
\end{align*}
and the mixture likelihood ratio is
\begin{align*}
L(y)=\frac{dQ}{dP}(y)=\frac1m\sum_{v\in\mathcal V}\exp\left\{\mu v\cdot y-\frac{\mu^2}{2}\right\}.
\end{align*}
Under $P$, write $Y\sim\mathcal N(0,I_d)$. Squaring the finite average gives
\begin{align*}
L(Y)^2=\frac1{m^2}\sum_{v,v'\in\mathcal V}\exp\left\{\mu(v+v')\cdot Y-\mu^2\right\}.
\end{align*}
Taking expectation term by term,
\begin{align*}
\mathbb E_P[L(Y)^2]=\frac1{m^2}\sum_{v,v'\in\mathcal V}\mathbb E_P\left[\exp\left\{\mu(v+v')\cdot Y-\mu^2\right\}\right].
\end{align*}
For $Y\sim\mathcal N(0,I_d)$ and fixed $t\in\mathbb R^d$, the Gaussian moment-generating identity gives
\begin{align*}
\mathbb E_P[\exp\{t\cdot Y\}]=\exp\left\{\frac{|t|^2}{2}\right\}.
\end{align*}
Using $t=\mu(v+v')$ therefore gives
\begin{align*}
\mathbb E_P\left[\exp\left\{\mu(v+v')\cdot Y-\mu^2\right\}\right]=\exp\left\{\frac{\mu^2|v+v'|^2}{2}-\mu^2\right\}.
\end{align*}
Because $|v|=|v'|=1$,
\begin{align*}
|v+v'|^2=|v|^2+2v\cdot v'+|v'|^2=2+2v\cdot v'.
\end{align*}
Thus
\begin{align*}
\frac{\mu^2|v+v'|^2}{2}-\mu^2=\frac{\mu^2(2+2v\cdot v')}{2}-\mu^2=\mu^2 v\cdot v'.
\end{align*}
Combining the preceding displays,
\begin{align*}
\mathbb E_P[L(Y)^2]=\frac1{m^2}\sum_{v,v'\in\mathcal V}\exp\{\mu^2 v\cdot v'\}.
\end{align*}
Equivalently, if $V$ and $V'$ are independent and uniform on $\mathcal V$, then
\begin{align*}
\mathbb E_P[L(Y)^2]=\mathbb E_{V,V'}[\exp\{\mu^2 V\cdot V'\}].
\end{align*}
So the exact second-moment condition is that this overlap average remains bounded. When this happens along a sequence of problems, *Second-Moment Contiguity Criterion* implies that the mixture alternative is contiguous to the null, so no test can reliably separate the null from this mixture in that signal regime.
[/example]
## Reading and Using the Rate Tables
A rate table is useful only if it records hypotheses. The final skill in the course is to read an entry as a compact theorem: model, parameter class, loss, estimator, lower-bound method, upper-bound method, and design or spectral assumptions.
[explanation: Checklist for a Rate Statement]
For each rate, identify the observation model and noise normalisation first. Then record the parameter class, including sparsity, bounded-spectrum, eigengap, or signal-strength assumptions. Next specify the loss: squared Euclidean, prediction, Frobenius, operator, support recovery, detection error, or sine-angle loss. Finally attach the lower-bound proof method and the upper-bound estimator.
This checklist prevents comparing unlike statements. A sparse regression prediction bound is not the same as a parameter estimation bound. A spectral PCA guarantee above the BBP threshold is not a statement below the threshold. A minimax lower bound for all estimators is not a computational lower bound unless a computational model has also been specified.
[/explanation]
The course ends with a compact view of its recurring proof patterns. The statistical lower bounds come from testing reductions and information inequalities. The random-matrix upper bounds come from concentration, limiting spectra, and perturbation. The most informative results are those where these two sides meet at the same scale.
[remark: Course Map]
Sparse means and sparse regression are governed by packing entropy $k\log(ed/k)$. Covariance estimation is governed by matrix dimension and spectral fluctuation. PCA is governed by eigengap, ambient dimension, and possible sparsity. Compressed sensing is governed by the same sparse entropy that appears in minimax estimation, but translated into uniform geometric embedding by the measurement matrix.
[/remark]
These templates also connect the course to neighbouring subjects. The packing and entropy calculations parallel metric entropy in empirical process theory and covering arguments in functional analysis. Restricted isometry in compressed sensing is a probabilistic version of stable embedding from geometric functional analysis. Davis-Kahan and Weyl inequalities are perturbation results from numerical linear algebra, while contiguity and likelihood-ratio second moments are the same language used in asymptotic statistics and random graph detection.
## Connections and Further Reading
This course sits between [High-Dimensional Statistics I: Sparsity and Regularisation](/page/High-Dimensional%20Statistics%20I%3A%20Sparsity%20and%20Regularisation), probability theory, measure theory, linear algebra, and functional analysis. The minimax chapters use probability as the language of experiments and risk, while the random matrix chapters use linear algebra and operator norms to turn high-dimensional sampling noise into spectral statements.
Several companion directions are natural after this material. Random matrix theory develops the spectral-limit side of Marchenko-Pastur and BBP phenomena. Compressed sensing refines the restricted-isometry viewpoint for sparse recovery. Covariance estimation and principal component analysis are the statistical settings where operator bounds, eigengaps, and spike thresholds become concrete. Hypothesis testing and minimax theory provide the decision-theoretic language for the lower bounds.
For readers moving through Androma, useful next stops include [High-Dimensional Statistics I: Sparsity and Regularisation](/page/High-Dimensional%20Statistics%20I%3A%20Sparsity%20and%20Regularisation) for the estimator-side prequel, probability theory for concentration and weak convergence, measure theory for the foundations of likelihoods and expectations, linear algebra for eigenvalue and singular-value methods, and functional analysis for the operator-norm perspective that underlies many high-dimensional estimates.
## References
- Alexandre B. Tsybakov, *Introduction to Nonparametric Estimation*, Springer, 2009.
- Martin J. Wainwright, *High-Dimensional Statistics: A Non-Asymptotic Viewpoint*, Cambridge University Press, 2019.
- Roman Vershynin, *High-Dimensional Probability: An Introduction with Applications in Data Science*, Cambridge University Press, 2018.
- Zhidong Bai and Jack W. Silverstein, *Spectral Analysis of Large Dimensional Random Matrices*, Springer, 2010.
- Iain M. Johnstone, *High Dimensional Statistical Inference and Random Matrices*, proceedings article, 2006.
- Debashis Paul and Jack W. Silverstein, "No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix", *Journal of Multivariate Analysis*, 2009.
Contents
- Introduction
- What Makes a High-Dimensional Problem Hard?
- From Estimation To Testing
- Information Inequalities As Lower-Bound Engines
- Random Matrices And Spectral Phenomena
- Thresholds, Rates, And The Shape Of The Course
- 1. Statistical Decision Theory in High Dimension
- Loss, Risk, and Minimax Formulation
- Packing, Covering, and Metric Entropy
- From Estimation to Testing
- Benchmark Minimax Examples
- 2. Fano's Inequality and Metric Entropy Lower Bounds
- Multiple Testing as the Core Obstruction
- From Testing Error to Estimation Risk
- Sparse Packings and the Varshamov-Gilbert Lemma
- Minimax Lower Bound for Sparse Gaussian Mean Estimation
- Sparse Linear Regression Under Gaussian Design
- Yang-Barron Entropy Lower Bounds
- 3. Assouad's Lemma and Coordinatewise Hardness
- Hypercube Reductions and Hamming Geometry
- Distances Between Neighboring Experiments
- Coordinatewise Hardness In Sparse Models
- 4. Sparse Linear Models Beyond Algorithmic Guarantees
- Losses in the Gaussian Linear Model
- Restricted Eigenvalues as Statistical Geometry
- Minimax Prediction Risk
- Euclidean Estimation Under Normalized Gaussian Design
- Support Recovery and Signal Strength
- 5. Compressed Sensing as a Statistical Experiment
- Sparse Recovery from Linear Measurements
- Information-Theoretic Sample Complexity
- Random Measurements and Johnson-Lindenstrauss Intuition
- Stable Recovery and Approximate Sparsity
- Phase Transitions and Donoho-Tanner Geometry
- 6. Random Matrix Preliminaries
- Empirical Spectral Distributions and Resolvents
- Gaussian and Sub-Gaussian Matrix Ensembles
- Singular Values, Operator Norms, and Conditioning
- Matrix Bernstein and Non-Asymptotic Covariance Control
- 7. Marchenko-Pastur Law and Sample Covariance Matrices
- Empirical Spectrum in the Proportional Limit
- Stieltjes Transform Derivation
- Spectral Edges and Covariance Estimation
- 8. Covariance Estimation and PCA in High Dimensions
- Spectral and Frobenius Losses for Covariance Estimation
- Eigenspace Perturbation and Principal Subspaces
- Classical PCA When Dimension Is Comparable to Sample Size
- Minimax Limits for Covariance Estimation
- 9. Spiked Covariance Models and BBP Transitions
- Rank-One and Finite-Rank Spiked Covariance Models
- Eigenvalue Separation and the BBP Transition
- Eigenvector Alignment
- Detection, Contiguity, and Likelihood Ratios
- Johnstone's Spiked Covariance Model and Statistical Interpretation
- 10. Minimax Testing, Detection, and Contiguity
- Distinguishing Null and Alternative Models
- Chi-Square Bounds and the Second Moment Method
- Contiguity and Le Cam Lemmas
- Sparse Normal Means and the Ingster Boundary
- What the Chapter Adds to the Minimax Toolkit
- 11. Connections to Modern High-Dimensional Phenomena
- Statistical-Computational Gaps in Sparse PCA
- Approximate Message Passing and State Evolution
- Double Descent and Benign Overfitting
- How the Three Phenomena Fit Together
- 12. Synthesis: Rate Tables and Proof Templates
- Choosing a Lower-Bound Method
- Matching Upper and Lower Bounds
- Canonical Minimax Rate Table
- Packing-Based Lower-Bound Template
- Spectral-Limit and Perturbation Template
- Detection, Chi-Square Bounds, and Contiguity
- Reading and Using the Rate Tables
- Connections and Further Reading
- References
High-Dimensional Statistics II: Minimax Theory and Random Matrices
Content
Problems
History
Created by admin on 6/7/2026 | Last updated on 6/7/2026
Prerequisites (0/3 completed)
Log in to track your prerequisite progress.
Prerequisites Graph
Interactive dependency map showing prerequisite concepts
Loading dependency graph...
Theorem
Definition
Current
Requires
Rate this page
★
★
★
★
★
Poor
Excellent