High-dimensional statistics studies what can be learned when the number of variables is comparable to, or much larger than, the number of observations. This course focuses on sparsity and regularisation as the two central ideas that make estimation possible in that regime. The main object throughout is the high-dimensional linear model, where only a small subset of predictors is assumed to matter, and the main goal is to recover or exploit that structure reliably in the presence of noise and strong collinearity.
The chapters develop this theory in a deliberate sequence. After introducing sparse linear models, the course builds the Lasso from multiple angles: its optimisation form, KKT conditions, and geometric interpretation. It then develops the probabilistic machinery needed for guarantees, including cone conditions and restricted eigenvalues, before proving oracle inequalities and studying variable selection and sign recovery. Later chapters broaden the toolkit to elastic net, adaptive Lasso, structured penalties, and the Dantzig selector, then move to nonconvex methods such as SCAD and MCP. The course concludes with debiased Lasso for inference, and with practical topics such as tuning, noise estimation, and diagnostics that connect the theory to applied work.
# Introduction
High-dimensional statistics studies inference when the ambient dimension $p$ is comparable to, or much larger than, the sample size $n$. In this regime, classical least squares is not enough: the data do not identify an unrestricted parameter vector, and successful estimation has to exploit structure. This course focuses on sparse structure and on regularisation methods that turn sparsity into finite-sample guarantees.
The main thread is the Lasso for linear regression, but the course is not only about an algorithm. We will track the statistical model, the geometry of the design matrix, the optimisation conditions, and the probability inequalities that control random scores. The guiding question is: under what assumptions can a high-dimensional estimator predict well, estimate a sparse vector accurately, and support valid inference?
## Why High Dimension Changes Regression
The first obstruction is already visible in the noiseless linear model. If $p>n$, the [linear map](/page/Linear%20Map) $X:\mathbb R^p \to \mathbb R^n$ has a non-trivial kernel, so many coefficient vectors give the same fitted values. A statistical method therefore needs a rule for selecting among indistinguishable or nearly indistinguishable explanations.
[example: Underdetermined Linear System]
Let $X \in \mathbb R^{n\times p}$ with $p>n$ and $\operatorname{rank}(X)=n$, and suppose $\beta_0\in\mathbb R^p$ satisfies $X\beta_0=Y$. Since $X$ has $p$ columns, the rank-nullity identity gives $\dim\ker(X)=p-\operatorname{rank}(X)$. Substituting $\operatorname{rank}(X)=n$ gives
\begin{align*}
\dim\ker(X)=p-n.
\end{align*}
Because $p>n$, we have $p-n>0$, so $\ker(X)$ contains some non-zero vector $h$. By definition of the kernel, this means
\begin{align*}
Xh=0.
\end{align*}
For any such $h$, linearity of matrix multiplication gives
\begin{align*}
X(\beta_0+h)=X\beta_0+Xh.
\end{align*}
Using $X\beta_0=Y$ and $Xh=0$, this becomes
\begin{align*}
X(\beta_0+h)=Y+0=Y.
\end{align*}
Thus $\beta_0+h$ is also a solution of $X\beta=Y$. If $h\ne 0$, then $\beta_0+h\ne\beta_0$, so the same response vector $Y$ is interpolated by two distinct coefficient vectors. Exact interpolation therefore identifies the fitted vector $X\beta$, but not the coefficient vector $\beta$; sparsity adds a rule for preferring solutions whose support is small.
[/example]
This non-identifiability is not only an algebraic nuisance. With noise, a method can fit random fluctuations unless the estimator is constrained or penalised. To compare procedures in this setting, we first need a loss that records whether the fitted mean vector is accurate even when the coefficient vector is not identifiable.
[definition: Prediction Loss]
Let $X \in \mathbb R^{n \times p}$ be a fixed design matrix. The prediction loss associated to $X$ is the map
\begin{align*}
L_X^{\mathrm{pred}} : \mathbb R^p \times \mathbb R^p &\to \mathbb R_{\ge 0}, &
(\hat\beta,\beta^*) &\mapsto \frac{1}{n}|X(\hat\beta-\beta^*)|^2.
\end{align*}
[/definition]
Prediction loss asks whether the fitted mean vector is close to the true mean vector on the observed design. It can be small even when the coefficient vector is not accurately recovered, so we also need a loss that measures the coefficient vector itself.
[definition: Estimation Loss]
For $q \in [1,\infty]$, the $\ell_q$ estimation loss is the map
\begin{align*}
L_q^{\mathrm{est}} : \mathbb R^p \times \mathbb R^p &\to \mathbb R_{\ge 0}, &
(\hat\beta,\beta^*) &\mapsto |\hat\beta-\beta^*|_q.
\end{align*}
[/definition]
Estimation loss is stronger because it measures the coefficient vector itself. Much of the course explains which geometric assumptions on $X$ allow prediction bounds to be converted into estimation bounds.
## Sparsity as a Statistical Assumption
The next question is what kind of structure is strong enough to compensate for $p\gg n$. The central assumption is that only a small subset of variables carries most of the signal. The language of supports lets us state this precisely.
[definition: Support]
The support map is
\begin{align*}
S : \mathbb R^p &\to \mathcal P(\{1,\dots,p\}), &
\beta &\mapsto \{j \in \{1,\dots,p\}: \beta_j \ne 0\}.
\end{align*}
[/definition]
The size of the support is the effective dimension in exact sparse models. A high-dimensional problem can behave like a low-dimensional one when the support is small and the design is favourable on sparse directions.
[definition: Exact Sparsity]
A vector $\beta^* \in \mathbb R^p$ is $s$-sparse if
\begin{align*}
|S(\beta^*)| \le s.
\end{align*}
[/definition]
Exact sparsity is the cleanest model for theory. Chapter 1 refines this into approximate sparsity, weak sparsity, and effective support, where many coefficients may be non-zero but most are small.
[example: Gene-Expression Regression]
In a gene-expression study, take $n=200$ patients and $p=20000$ measured genes. The design matrix has the form $X\in\mathbb R^{200\times 20000}$, where $X_{ij}$ is the expression level of gene $j$ for patient $i$, and a linear model writes the mean response for patient $i$ as
\begin{align*}
(X\beta^*)_i=\sum_{j=1}^{20000}X_{ij}\beta^*_j.
\end{align*}
If only $s=15$ genes affect the response, then the active set is
\begin{align*}
S(\beta^*)=\{j:\beta^*_j\ne 0\},
\end{align*}
and its size is
\begin{align*}
|S(\beta^*)|=15.
\end{align*}
For every inactive gene $j\notin S(\beta^*)$, the definition of the support gives $\beta^*_j=0$, so its contribution to the fitted mean for patient $i$ is
\begin{align*}
X_{ij}\beta^*_j=X_{ij}\cdot 0=0.
\end{align*}
Splitting the full sum into active and inactive coordinates gives
\begin{align*}
(X\beta^*)_i=\sum_{j\in S(\beta^*)}X_{ij}\beta^*_j+\sum_{j\notin S(\beta^*)}X_{ij}\beta^*_j.
\end{align*}
Substituting $X_{ij}\beta^*_j=0$ for every $j\notin S(\beta^*)$ gives
\begin{align*}
(X\beta^*)_i=\sum_{j\in S(\beta^*)}X_{ij}\beta^*_j+\sum_{j\notin S(\beta^*)}0.
\end{align*}
Since summing zeros contributes nothing,
\begin{align*}
(X\beta^*)_i=\sum_{j\in S(\beta^*)}X_{ij}\beta^*_j.
\end{align*}
The sparse model therefore says that the relevant dimension for prediction and variable screening is the active set size $s=15$, not the ambient number of measured genes $p=20000$.
[/example]
This example also warns that sparsity alone is not enough. If two active or inactive covariates are almost identical across the sample, the data may not distinguish which one should be selected. Geometry of the design matrix will therefore enter every major result.
## Regularisation and the Lasso
Once sparsity is the target structure, the statistical procedure needs a computationally tractable way to prefer sparse vectors. Directly minimising over all $s$-sparse vectors is combinatorial. The Lasso replaces this search by an $\ell_1$ penalty, which is convex and still favours sparse solutions.
[definition: Lasso Estimator]
Let $Y \in \mathbb R^n$, let $X \in \mathbb R^{n \times p}$, and let $\lambda>0$. A Lasso estimator is any solution
\begin{align*}
\hat\beta \in \operatorname*{argmin}_{\beta \in \mathbb R^p}
\left\{\frac{1}{2n}|Y-X\beta|^2 + \lambda |\beta|_1\right\}.
\end{align*}
[/definition]
The squared-error term asks for fit to the data, while the $\ell_1$ term charges for total coefficient magnitude. The tuning parameter $\lambda$ controls the tradeoff: larger values produce more shrinkage and usually fewer selected variables.
[example: Orthonormal Design and Soft Thresholding]
Suppose $p\le n$ and $X^\top X/n=I_p$, and write $z=X^\top Y/n$. For any $\beta\in\mathbb R^p$, the Lasso objective is
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2+\lambda|\beta|_1=\frac{1}{2n}(Y-X\beta)^\top(Y-X\beta)+\lambda\sum_{j=1}^p|\beta_j|.
\end{align*}
Expanding the quadratic term gives
\begin{align*}
\frac{1}{2n}(Y-X\beta)^\top(Y-X\beta)=\frac{1}{2n}Y^\top Y-\frac{1}{n}Y^\top X\beta+\frac{1}{2n}\beta^\top X^\top X\beta.
\end{align*}
Since $z=X^\top Y/n$ and $X^\top X/n=I_p$, this becomes
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2+\lambda|\beta|_1=\frac{1}{2n}|Y|^2-z^\top\beta+\frac{1}{2}\beta^\top\beta+\lambda\sum_{j=1}^p|\beta_j|.
\end{align*}
Writing the inner products coordinate by coordinate,
\begin{align*}
-z^\top\beta+\frac{1}{2}\beta^\top\beta+\lambda\sum_{j=1}^p|\beta_j|=\sum_{j=1}^p\left(\frac{1}{2}\beta_j^2-z_j\beta_j+\lambda|\beta_j|\right).
\end{align*}
The term $|Y|^2/(2n)$ does not depend on $\beta$, so minimising the full objective is equivalent to minimising, independently for each $j$,
\begin{align*}
f_j(b)=\frac{1}{2}b^2-z_jb+\lambda|b|.
\end{align*}
Fix one coordinate and write $z=z_j$. If $b>0$, then $|b|=b$, so
\begin{align*}
f_j(b)=\frac{1}{2}b^2-zb+\lambda b.
\end{align*}
Its derivative on this region is
\begin{align*}
f_j'(b)=b-z+\lambda.
\end{align*}
Thus the only critical point with $b>0$ is
\begin{align*}
b=z-\lambda,
\end{align*}
and this lies in the positive region exactly when $z>\lambda$.
If $b<0$, then $|b|=-b$, so
\begin{align*}
f_j(b)=\frac{1}{2}b^2-zb-\lambda b.
\end{align*}
Its derivative on this region is
\begin{align*}
f_j'(b)=b-z-\lambda.
\end{align*}
Thus the only critical point with $b<0$ is
\begin{align*}
b=z+\lambda,
\end{align*}
and this lies in the negative region exactly when $z<-\lambda$.
It remains to identify when $b=0$ is optimal. For $b\ge 0$,
\begin{align*}
f_j(b)-f_j(0)=\frac{1}{2}b^2-zb+\lambda b.
\end{align*}
Factoring the terms involving $b$ gives
\begin{align*}
f_j(b)-f_j(0)=\frac{1}{2}b^2+b(\lambda-z).
\end{align*}
This is non-negative for every $b\ge 0$ whenever $z\le\lambda$. For $b\le 0$, write $b=-u$ with $u\ge 0$. Then
\begin{align*}
f_j(-u)-f_j(0)=\frac{1}{2}u^2+zu+\lambda u.
\end{align*}
Factoring the terms involving $u$ gives
\begin{align*}
f_j(-u)-f_j(0)=\frac{1}{2}u^2+u(\lambda+z).
\end{align*}
This is non-negative for every $u\ge 0$ whenever $z\ge-\lambda$. Hence $b=0$ is optimal exactly when $-\lambda\le z\le\lambda$.
Combining the three regions, the coordinate minimiser is $z_j-\lambda$ when $z_j>\lambda$, is $0$ when $|z_j|\le\lambda$, and is $z_j+\lambda$ when $z_j<-\lambda$. Equivalently,
\begin{align*}
\hat\beta_j=\operatorname{sgn}(z_j)(|z_j|-\lambda)_+.
\end{align*}
Thus the $\ell_1$ penalty sets empirical correlations with $|z_j|\le\lambda$ exactly to zero, while every retained coordinate is moved toward zero by distance $\lambda$; this is the source of both sparsity and shrinkage bias in the orthonormal case.
[/example]
The orthonormal case gives the right mental model but hides the main difficulty. In general designs, coordinates interact through $X^\top X/n$, so we need a deterministic inequality that isolates the prediction error, the stochastic score, and the penalty term before any design assumptions are imposed.
[quotetheorem:5549]
[citeproof:5549]
The model assumption $Y=X\beta^*+\varepsilon$ is what identifies the random score term: without it, the residual cannot be decomposed into signal error plus noise. The fixed-design hypothesis keeps all randomness in $\varepsilon$, so the later concentration step only has to control $X^\top\varepsilon/n$ rather than a random matrix and random noise together. The Lasso optimality hypothesis is also essential; an arbitrary penalised-looking vector need not satisfy the comparison inequality against $\beta^*$.
This inequality separates the analysis into two tasks. Probability controls the stochastic score term $X^\top\varepsilon/n$, while geometry and sparsity control the penalty difference. By itself it does not imply consistency or sparsity recovery: if the score is large, or if the design collapses sparse directions, the right-hand side may not force the prediction error to be small.
## Probability Enters Through Score Concentration
The Lasso tuning parameter must dominate random correlations between the design columns and the noise. The key probabilistic object is the score vector $X^\top\varepsilon/n$. In high dimension, we need simultaneous control of all $p$ coordinates.
[definition: Normalised Fixed Design]
A fixed design matrix $X \in \mathbb R^{n\times p}$ is column-normalised if
\begin{align*}
\frac{1}{n}|X_j|^2 = 1
\end{align*}
for each column $X_j \in \mathbb R^n$, $j\in\{1,\dots,p\}$.
[/definition]
Column normalisation puts all covariates on the same scale, so a single penalty level can be compared across coordinates. With this scaling in place, the next result gives the probability event on which the stochastic term in the basic inequality is small enough for the penalty to control it.
[quotetheorem:5550]
[citeproof:5550]
The logarithmic dependence on $p$ is the first signature of high-dimensional probability in the course. Column normalisation is what makes the displayed threshold uniform across coordinates; without it, the bound would involve the column norms $|X_j|/n$, and columns on larger scales would need a larger penalty. Gaussianity gives the exact tail estimate used here, while sub-Gaussian noise gives the same order with different constants and heavier-tailed noise needs truncation, robustification, or stronger moment assumptions. The result controls only the score event: estimation still requires the deterministic Lasso inequality and a design condition preventing sparse errors from hiding in nearly-null directions.
[example: Choosing the Lasso Penalty]
Under the assumptions of Gaussian score concentration, set
\begin{align*}
\lambda = 2\sigma\sqrt{\frac{2(t+\log p)}{n}}.
\end{align*}
Dividing this identity by $2$ gives
\begin{align*}
\frac{\lambda}{2}=\sigma\sqrt{\frac{2(t+\log p)}{n}}.
\end{align*}
The Gaussian score concentration bound says
\begin{align*}
\mathbb P\left(\left|\frac{X^\top\varepsilon}{n}\right|_\infty>\sigma\sqrt{\frac{2(t+\log p)}{n}}\right)\le 2e^{-t}.
\end{align*}
Substituting $\lambda/2=\sigma\sqrt{2(t+\log p)/n}$ into the threshold gives
\begin{align*}
\mathbb P\left(\left|\frac{X^\top\varepsilon}{n}\right|_\infty>\frac{\lambda}{2}\right)\le 2e^{-t}.
\end{align*}
Therefore its complement satisfies
\begin{align*}
\mathbb P\left(\left|\frac{X^\top\varepsilon}{n}\right|_\infty\le\frac{\lambda}{2}\right)\ge 1-2e^{-t}.
\end{align*}
Thus this penalty level makes the stochastic score no larger than half the penalty with probability at least $1-2e^{-t}$. This is the event on which the [basic Lasso inequality](/theorems/5549) can be combined with sparsity to produce the cone constraints of Chapter 3 and the oracle inequalities of Chapter 4.
[/example]
## Geometry, Oracle Bounds, and Inference
After concentration controls the random part, the remaining question is deterministic: which directions can the design matrix see? Sparse estimators make errors that are concentrated near sparse cones, so the relevant lower bounds on $X$ only need to hold on those cones, not on all of $\mathbb R^p$.
[definition: Restricted Eigenvalue Condition]
Let $S\subset\{1,\dots,p\}$ and let $c_0>0$. A fixed design matrix $X\in\mathbb R^{n\times p}$ satisfies a restricted eigenvalue condition on $(S,c_0)$ with constant $\kappa>0$ if
\begin{align*}
\frac{|X\Delta|}{\sqrt n} \ge \kappa |\Delta_S|_2
\end{align*}
for every $\Delta\in\mathbb R^p$ such that $|\Delta_{S^c}|_1\le c_0|\Delta_S|_1$.
[/definition]
The restricted eigenvalue condition is weaker than requiring $X^\top X/n$ to be positive definite on all directions, which cannot happen when $p>n$. It asks only that the design be non-degenerate on the error directions produced by sparse regularisation. Together with the score event, it yields the basic oracle inequality that motivates the detailed proofs in the next chapters.
[quotetheorem:5551]
[citeproof:5551]
Each hypothesis rules out a different failure mode. If the score event fails, the noise correlations can overwhelm the penalty and the basic inequality no longer forces the error into a useful cone. If exact sparsity is replaced by a dense vector with many moderate coefficients, the effective dimension is no longer $s$ and an additional approximation term is needed. If the restricted eigenvalue condition fails, there may be a non-zero sparse-cone direction $\Delta$ with $X\Delta$ very small, so prediction can be accurate while coefficient estimation remains unstable.
This prototype theorem is the template for much of the course, but it is deliberately not the final word. It gives rates for prediction and $\ell_1$ estimation, not exact support recovery, sign consistency, valid confidence intervals, or sharp constants. Chapters 3 and 4 refine the assumptions for oracle bounds, Chapter 5 asks when support recovery is possible, Chapters 6 through 8 compare alternative penalties, and Chapter 9 develops confidence intervals.
[remark: From Estimation to Inference]
Regularisation introduces bias, so good prediction and estimation bounds do not automatically yield classical confidence intervals. Debiased and desparsified methods add a correction term designed to remove first-order shrinkage bias. Their analysis combines the same score concentration and restricted geometry with additional approximations to inverse covariance structure.
[/remark]
The course ends by returning to the opening question in a sharper form. Sparse regularisation succeeds when the signal is structured, the noise is controlled uniformly over many coordinates, and the design matrix preserves enough information on the sparse error cone. It fails, or needs modification, when one of these three ingredients breaks down.
With the overall roadmap in place, the course now fixes the main statistical object: a linear signal observed through noise in a regime where the ambient dimension may exceed the sample size. The next chapter makes that setting precise and introduces the sparse structure that allows estimation to remain possible despite the lack of classical identifiability.
# 1. High-Dimensional Linear Models and Sparse Structure
Building on the introductory roadmap, this chapter fixes the statistical object studied throughout the course: a linear signal observed through noise in a regime where the ambient dimension may be much larger than the sample size. Classical least squares relies on having enough samples to identify all coordinates of the coefficient vector, while high-dimensional statistics replaces full-dimensional identifiability by structural assumptions such as sparsity. The aim is to separate the modelling assumptions, the loss functions, and the probabilistic concentration estimates that later make regularised estimators analyzable.
## Linear Models in High Dimension
The first question is what remains meaningful about linear regression when there are more covariates than observations. The equation $Y = X\beta^* + \varepsilon$ is familiar from classical regression, but in the high-dimensional setting its interpretation must be made precise before estimation can be discussed. We start by naming the objects in the model so that later assumptions can be attached to the design, the noise, or the target vector separately.
[definition: High-Dimensional Linear Model]
Let $n,p \in \mathbb N$. A high-dimensional linear model consists of a response vector $Y \in \mathbb R^n$, a design matrix $X \in \mathbb R^{n \times p}$, an unknown coefficient vector $\beta^* \in \mathbb R^p$, and a noise vector $\varepsilon \in \mathbb R^n$ satisfying
\begin{align*}
Y = X\beta^* + \varepsilon.
\end{align*}
[/definition]
The adjective high-dimensional refers to the scaling of $p$ relative to $n$, not to a different algebraic model. The next modelling choice is whether the design matrix is treated as part of the experimental environment or as another random object. This distinction is needed because concentration may be required only for the noise, or for both the noise and the empirical geometry of the covariates.
[definition: Fixed Design Linear Model]
In the fixed design linear model, $X \in \mathbb R^{n \times p}$ is treated as deterministic, while $\varepsilon$ is random and $Y$ is random through the equation $Y=X\beta^*+\varepsilon$.
[/definition]
Fixed design is natural when the covariates are chosen by an experimenter, or when the analysis is conditional on observed covariates. Many data sets instead arrive as independent sampled pairs, so the design itself has a distribution. This motivates the random design formulation, where the course later proves that suitable sample matrices inherit good geometry from the population.
[definition: Random Design Linear Model]
In the random design linear model, the rows $x_1,\dots,x_n \in \mathbb R^p$ of $X$ are random vectors, the responses satisfy
\begin{align*}
Y_i = x_i \cdot \beta^* + \varepsilon_i, \qquad 1 \le i \le n,
\end{align*}
and probability statements are made over the joint distribution of the design and the noise.
[/definition]
Random design assumptions add an additional layer of concentration, while fixed-design results isolate the deterministic geometry required for estimation. Once the source of randomness is settled, the next problem is to decide what kind of error should be measured. In high dimension, predicting the fitted values and estimating the coefficient vector are different goals.
[definition: Prediction Loss]
For fixed $X \in \mathbb R^{n \times p}$ and $\beta^* \in \mathbb R^p$, the empirical prediction loss is the map $L_{\mathrm{pred}}: \mathbb R^p \to \mathbb R_+$ defined by
\begin{align*}
L_{\mathrm{pred}}(\hat{\beta}) = \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2.
\end{align*}
[/definition]
Prediction loss is the most immediate quantity controlled by least-squares type objectives. It may be small even when $\hat{\beta}$ is far from $\beta^*$, because the null space of $X$ can hide coefficient error. To discuss recovery of the unknown parameter itself, we also need losses measured directly in coefficient space.
[definition: Estimation Loss]
For fixed $\beta^* \in \mathbb R^p$ and $q \in [1,\infty]$, the $\ell_q$ estimation loss is the map $L_q: \mathbb R^p \to \mathbb R_+$ defined by
\begin{align*}
L_q(\hat{\beta}) = \|\hat{\beta}-\beta^*\|_q.
\end{align*}
[/definition]
Estimation loss is stronger than prediction loss in ill-conditioned or underdetermined problems. The oracle inequalities in Chapters 3 and 4 will first control prediction loss and then convert it into coefficient error using restricted eigenvalue conditions. Before those geometric conditions appear, the orthogonal case gives a warm-up in which prediction and estimation are aligned.
[example: Orthogonal Design Warm-Up]
Suppose $p \le n$, $X \in \mathbb R^{n\times p}$ has orthogonal columns, and $X^\top X/n=I_p$. We compute the coefficient error of the least-squares estimator $\hat{\beta}^{\mathrm{ols}}$ from the normal equations
\begin{align*}
X^\top(Y-X\hat{\beta}^{\mathrm{ols}})=0.
\end{align*}
Using the model equation $Y=X\beta^*+\varepsilon$, the left-hand side is
\begin{align*}
X^\top(Y-X\hat{\beta}^{\mathrm{ols}})=X^\top(X\beta^*+\varepsilon-X\hat{\beta}^{\mathrm{ols}}).
\end{align*}
By distributivity of matrix multiplication,
\begin{align*}
X^\top(X\beta^*+\varepsilon-X\hat{\beta}^{\mathrm{ols}})=X^\top X\beta^*+X^\top\varepsilon-X^\top X\hat{\beta}^{\mathrm{ols}}.
\end{align*}
Combining the two terms involving $X^\top X$ gives
\begin{align*}
X^\top X\beta^*+X^\top\varepsilon-X^\top X\hat{\beta}^{\mathrm{ols}}=X^\top\varepsilon-X^\top X(\hat{\beta}^{\mathrm{ols}}-\beta^*).
\end{align*}
Therefore the normal equations imply
\begin{align*}
X^\top X(\hat{\beta}^{\mathrm{ols}}-\beta^*)=X^\top\varepsilon.
\end{align*}
Since $X^\top X/n=I_p$, we have $X^\top X=nI_p$ and hence $(X^\top X)^{-1}=(1/n)I_p$. Multiplying the preceding identity by $(X^\top X)^{-1}$ gives
\begin{align*}
\hat{\beta}^{\mathrm{ols}}-\beta^*=(X^\top X)^{-1}X^\top\varepsilon.
\end{align*}
Substituting $(X^\top X)^{-1}=(1/n)I_p$ yields
\begin{align*}
\hat{\beta}^{\mathrm{ols}}-\beta^*=\frac{1}{n}X^\top\varepsilon.
\end{align*}
Thus the noise score $X^\top\varepsilon/n$ is exactly the coefficient error in the orthogonal case. Its $j$th coordinate is $n^{-1}X_{\cdot j}^\top\varepsilon$, the empirical correlation between the noise and the $j$th design column, so controlling this score directly controls coefficient error here; in non-orthogonal high-dimensional designs, the same score is the stochastic term that appears in the optimality conditions for regularised estimators.
[/example]
The orthogonal case is a guide rather than the typical case. When $p>n$, exact orthogonality of all $p$ columns in $\mathbb R^n$ is impossible, and many coefficient vectors can share the same fitted values. The next section introduces sparsity as the structural assumption that reduces the effective dimension of the problem.
## Sparse Structure and Effective Dimension
The next question is what structural assumption can replace the classical condition $p<n$. Sparsity says that the signal is not genuinely $p$-dimensional, even though it is represented in $\mathbb R^p$. To state sparsity, we first need a precise way to record which coordinates of a coefficient vector are active.
[definition: Support]
The support map $\operatorname{supp}: \mathbb R^p \to \mathcal P(\{1,\dots,p\})$ is defined by
\begin{align*}
\operatorname{supp}(\beta) = \{j \in \{1,\dots,p\}: \beta_j \ne 0\}.
\end{align*}
[/definition]
The support records which covariates are active in the linear signal. Its cardinality is the most basic measure of sparsity, because it counts the number of coordinates that must be learned. This leads to the exact sparsity model used in the first oracle inequalities.
[definition: Exact Sparsity]
A vector $\beta^* \in \mathbb R^p$ is $s$-sparse if
\begin{align*}
|\operatorname{supp}(\beta^*)| \le s.
\end{align*}
[/definition]
Exact sparsity is mathematically clean and treats all inactive coordinates as zero. It is useful for first proofs because the effective number of parameters is $s$ rather than $p$. The following example illustrates why this idealisation is plausible in applications, even when the ambient list of covariates is enormous.
[example: Gene-Expression Regression]
In a gene-expression study, let $Y_i$ be the treatment response of patient $i$, and let $X_{ij}$ be the measured expression level of transcript $j$ for that patient. The linear model writes each response as
\begin{align*}
Y_i = \sum_{j=1}^p X_{ij}\beta^*_j + \varepsilon_i, \qquad 1\le i\le n.
\end{align*}
If the active genes are collected in a set $S\subseteq \{1,\dots,p\}$ with $|S|=s$, exact sparsity means
\begin{align*}
\beta^*_j=0 \quad \text{for every } j\notin S.
\end{align*}
For each patient $i$, the linear predictor can then be split into active and inactive coordinates:
\begin{align*}
\sum_{j=1}^p X_{ij}\beta^*_j = \sum_{j\in S} X_{ij}\beta^*_j + \sum_{j\notin S} X_{ij}\beta^*_j.
\end{align*}
Using $\beta^*_j=0$ for $j\notin S$, the inactive part is
\begin{align*}
\sum_{j\notin S} X_{ij}\beta^*_j = \sum_{j\notin S} X_{ij}\cdot 0 = 0.
\end{align*}
Therefore
\begin{align*}
\sum_{j=1}^p X_{ij}\beta^*_j = \sum_{j\in S} X_{ij}\beta^*_j.
\end{align*}
Thus the ambient number of measured transcripts may be in the thousands or tens of thousands, while the model treats only the $s$ coordinates in $S$ as contributing to the signal; the inactive genes remain measured covariates, but their coefficients do not enter the linear predictor.
[/example]
The gene-expression example also points to a limitation: real signals often have many small coefficients rather than many exact zeros. A useful theory should compare the true vector with its best sparse surrogate and charge a separate approximation error for the discarded coordinates. This motivates the notion of best sparse approximation.
[definition: Best Sparse Approximation]
For $\beta \in \mathbb R^p$ and $s \in \{1,\dots,p\}$, a best $s$-sparse approximation to $\beta$ in $\ell_1$ is any vector $\beta^{(s)} \in \mathbb R^p$ satisfying $|\operatorname{supp}(\beta^{(s)})| \le s$ and
\begin{align*}
\|\beta - \beta^{(s)}\|_1 = \inf_{|S|\le s}\sum_{j \notin S}|\beta_j|.
\end{align*}
[/definition]
This definition formalises the operation of retaining the largest coordinates and treating the rest as approximation error. Once the best sparse surrogate is available, the next question is how small the remaining tail must be. Approximate sparsity records this tail size as part of the model class.
[definition: Approximate Sparsity]
A vector $\beta^* \in \mathbb R^p$ is approximately sparse at level $s$ with tail radius $r_s \ge 0$ if there exists an $s$-sparse vector $\beta^{(s)} \in \mathbb R^p$ such that
\begin{align*}
\|\beta^* - \beta^{(s)}\|_1 \le r_s.
\end{align*}
[/definition]
Approximate sparsity lets the theory separate estimation of the dominant coordinates from the bias caused by the small remaining coordinates. Another common way to express compressibility is not to choose $s$ in advance, but to impose a decay condition on all coordinates. This leads to weak sparsity classes.
[definition: Weak Sparsity]
For $0<q<1$ and $R>0$, a vector $\beta^* \in \mathbb R^p$ belongs to the weak sparsity class $\mathcal B_q(R)$ if
\begin{align*}
\sum_{j=1}^p |\beta^*_j|^q \le R.
\end{align*}
[/definition]
Weak sparsity is not a norm condition when $q<1$, but it is useful because it forces the ordered coefficients to decay. In statistical estimation, coefficients below the noise scale cannot be reliably separated from zero. We therefore introduce an effective support at a chosen resolution threshold.
[definition: Effective Support]
For a threshold $\tau>0$, the effective support of $\beta^* \in \mathbb R^p$ at level $\tau$ is
\begin{align*}
S_\tau(\beta^*) = \{j \in \{1,\dots,p\}: |\beta^*_j| > \tau\}.
\end{align*}
[/definition]
The effective support depends on the noise level and sample size, not only on the signal. The next result shows that weak sparsity bounds the number of coordinates visible above any threshold. This is the first example of converting a high-dimensional vector assumption into an effective dimension bound.
[quotetheorem:5552]
[citeproof:5552]
This elementary bound is the prototype for many effective-dimension arguments. It converts a continuous decay condition into a finite support size at the resolution level $\tau$. The hypothesis $\beta^* \in \mathcal B_q(R)$ is essential: if $\beta^*_j=1$ for all $1\le j\le p$, then $|S_{1/2}(\beta^*)|=p$, so no bound independent of the ambient dimension can hold without a global size condition. The theorem also does not identify the active variables or assert that the visible coordinates are recoverable from data; it only gives a deterministic counting bound once the coefficient vector is known to satisfy the quasi-ball condition. Polynomial decay gives a concrete example where the effective support can be read off from the coefficient sequence.
[example: Polynomially Decaying Coefficients]
Let $\beta^*_j=Cj^{-a}$ for $1\le j\le p$, where $C>0$ and $a>1/q$. Since $C>0$ and $j^{-a}>0$, we have
\begin{align*}
|\beta^*_j|^q
= |Cj^{-a}|^q
= C^q(j^{-a})^q
= C^q j^{-aq}.
\end{align*}
Therefore
\begin{align*}
\sum_{j=1}^p |\beta^*_j|^q
= \sum_{j=1}^p C^q j^{-aq}
= C^q\sum_{j=1}^p j^{-aq}
\le C^q\sum_{j=1}^\infty j^{-aq}.
\end{align*}
Because $a>1/q$, multiplying by $q>0$ gives $aq>1$, so the $p$-series $\sum_{j=1}^\infty j^{-aq}$ converges. Setting
\begin{align*}
R = C^q\sum_{j=1}^\infty j^{-aq}
\end{align*}
gives $\sum_{j=1}^p |\beta^*_j|^q \le R$, hence $\beta^*\in \mathcal B_q(R)$.
For a threshold $\tau>0$, the effective support is
\begin{align*}
S_\tau(\beta^*)
= \{j\in\{1,\dots,p\}: |\beta^*_j|>\tau\}
= \{j\in\{1,\dots,p\}: Cj^{-a}>\tau\}.
\end{align*}
Since $Cj^{-a}>\tau$ is equivalent to $j^{-a}>\tau/C$, and all quantities are positive, this is equivalent to
\begin{align*}
j^a < \frac{C}{\tau},
\end{align*}
or, after taking the positive $a$th root,
\begin{align*}
j < \left(\frac{C}{\tau}\right)^{1/a}.
\end{align*}
Thus $S_\tau(\beta^*)$ consists exactly of the indices $1\le j\le p$ lying below $(C/\tau)^{1/a}$, so
\begin{align*}
|S_\tau(\beta^*)|
= \#\left\{j\in\{1,\dots,p\}: j<\left(\frac{C}{\tau}\right)^{1/a}\right\}
\le \left(\frac{C}{\tau}\right)^{1/a}.
\end{align*}
This makes the decay interpretation explicit: faster polynomial decay, meaning larger $a$, leaves fewer coefficients visible above a fixed threshold $\tau$.
[/example]
The language of exact, approximate, and weak sparsity gives several levels of idealisation. Exact sparsity is best for first proofs, while approximate and weak sparsity are closer to applications where small coefficients form a long tail. The final ingredient for this chapter is the probabilistic scale at which noise masks small coefficients.
## Scaling, Normalisation, and Noise
The final preparatory question is why logarithms of $p$ appear throughout high-dimensional rates. Even before choosing an estimator, we must control the largest empirical correlation between the design columns and the noise. To make the coordinates comparable, the design columns must first be placed on a common scale.
[definition: Column Normalisation]
A fixed design matrix $X \in \mathbb R^{n\times p}$ is column-normalised if
\begin{align*}
\frac{1}{n}\sum_{i=1}^n X_{ij}^2 = 1
\end{align*}
for every $j \in \{1,\dots,p\}$.
[/definition]
Column normalisation puts all coordinates on the same scale. Without it, a single tuning parameter for an $\ell_1$ penalty would shrink different covariates by incomparable amounts. After fixing the scale of the design, the next assumption controls the tails of the noise variables.
A mean-zero real-valued [random variable](/page/Random%20Variable) $Z$ is called sub-Gaussian with variance proxy $\sigma^2>0$ if
\begin{align*}
\mathbb E[e^{tZ}] \le \exp\left(\frac{\sigma^2t^2}{2}\right)
\end{align*}
for every $t\in \mathbb R$. Sub-Gaussian noise is the standard light-tailed assumption for the first part of the course. The noise score is the vector $S=X^\top\varepsilon/n \in \mathbb R^p$, whose $j$th coordinate is the empirical correlation between the $j$th design column and the noise. Regularisation succeeds only if the tuning parameter is large enough to dominate this random vector in $\ell_\infty$, so we need a concentration theorem for $S$.
[quotetheorem:5550]
[citeproof:5550]
This theorem explains the basic high-dimensional noise level. Up to constants, the largest noise correlation has size $\sigma\sqrt{\log p/n}$; the logarithm appears because all $p$ coordinates must be controlled simultaneously. Each assumption has a role. Fixed design lets us condition on $X$ and compute the variance of each score coordinate directly; in random design, an additional event controlling the column norms is needed. Column normalisation is essential for this stated scale: if one column is multiplied by $M$, then the corresponding score coordinate is multiplied by $M$, so no common bound of order $\sigma\sqrt{\log p/n}$ can hold uniformly. Independence and Gaussian tails are also doing work; perfectly dependent or heavy-tailed noise can make a union-bound Gaussian tail estimate false. Finally, this result controls only the stochastic score, not coefficient recovery: if $X$ has two identical columns, prediction may be controlled while the two coefficients remain non-identifiable. The informal notation used in lectures records this scale while suppressing the chosen probability level.
[remark: Meaning of the Symbol Lesssim]
In this chapter, the informal bound
\begin{align*}
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty \lesssim \sigma\sqrt{\frac{\log p}{n}}
\end{align*}
means that the inequality holds up to a universal numerical constant, with a high-probability parameter suppressed. Later results state the probability level explicitly when the tuning parameter is chosen.
[/remark]
The theorem is stated for Gaussian noise because the proof is short and reveals the mechanism. Under independent sub-Gaussian noise with common variance proxy $\sigma^2$, the same union-bound argument applies after replacing the Gaussian tail inequality by the sub-Gaussian tail inequality. The resulting scale is the starting point for choosing penalties in sparse regression.
[example: Choosing the Regularisation Scale]
In a fixed-design problem with $p=10{,}000$, $n=200$, and noise standard deviation $\sigma$, the score bound predicts the stochastic scale
\begin{align*}
\sigma\sqrt{\frac{\log p}{n}}.
\end{align*}
Substituting $p=10{,}000$ and $n=200$ gives
\begin{align*}
\sigma\sqrt{\frac{\log p}{n}} = \sigma\sqrt{\frac{\log(10{,}000)}{200}}.
\end{align*}
Since $10{,}000=10^4$, the logarithm identity $\log(a^b)=b\log(a)$ gives
\begin{align*}
\log(10{,}000)=\log(10^4)=4\log(10).
\end{align*}
Using $\log(10)\approx 2.302585$, this becomes
\begin{align*}
4\log(10)\approx 4\cdot 2.302585=9.21034.
\end{align*}
Dividing by $200$ gives
\begin{align*}
\frac{\log(10{,}000)}{200}\approx \frac{9.21034}{200}=0.0460517.
\end{align*}
Therefore
\begin{align*}
\sigma\sqrt{\frac{\log(10{,}000)}{200}}\approx \sigma\sqrt{0.0460517}\approx 0.2146\,\sigma.
\end{align*}
Thus, even with $10{,}000$ covariates, the stochastic score scale is about $21\%$ of the noise standard deviation in this calculation; the cost of controlling many noise correlations grows through $\log p$, not through $p$ itself, provided the later sparsity and design-geometry assumptions are available.
[/example]
The chapter has now isolated the ingredients that the Lasso analysis in Chapter 2 will combine. Sparsity reduces the effective number of unknowns, column normalisation makes a single penalty level meaningful, and score concentration identifies the scale at which the penalty should dominate noise.
The model and the score vector together show why the Lasso is the natural estimator to study next: it converts a noisy, high-dimensional regression problem into an optimisation problem with an explicit sparsity-inducing penalty. The following chapter develops its formulation, optimality conditions, and geometry so the later risk analysis has a deterministic foundation.
# 2. The Lasso: Formulations, KKT Conditions, and Geometry
The first chapter set up the high-dimensional linear model and explained why the score vector
\begin{align*}
\frac{1}{n}X^\top \varepsilon
\end{align*}
governs the noise level seen by many covariates at once. This chapter introduces the Lasso as the basic estimator that turns that observation into a computable procedure. The main prerequisites are convex optimisation in finite dimensions, subgradients of convex functions, and the basic geometry of the $\ell_1$ and $\ell_\infty$ norms. The main themes are the equivalence between penalised and constrained formulations, the KKT equations that describe the solution, and the geometry of shrinkage in the orthonormal and correlated-design cases.
Throughout this chapter, as in the fixed-design setup of Chapter 1, take a fixed design matrix $X \in \mathbb R^{n \times p}$ and response vector $Y \in \mathbb R^n$. Unless stated otherwise, the empirical squared loss is the functional $L_n:\mathbb R^p \to \mathbb R$ defined by
\begin{align*}
L_n:\beta \mapsto \frac{1}{2n}|Y-X\beta|^2.
\end{align*}
The factor $1/(2n)$ is chosen so that gradients contain the residual-correlation vector
\begin{align*}
\frac{1}{n}X^\top(Y-X\beta)
\end{align*}
rather than extra constants.
## Penalised and Constrained Formulations
The high-dimensional difficulty is not only that $p$ may be larger than $n$, but that many vectors can fit the data comparably well. The Lasso selects among these vectors by combining fit with an $\ell_1$ preference for sparse coefficient vectors. There are two common ways to encode this preference: add a penalty to the loss, or restrict the feasible set to an $\ell_1$ ball.
[definition: Penalised Lasso]
For $\lambda \ge 0$, a penalised Lasso estimator is any minimiser
\begin{align*}
\hat\beta_\lambda \in \operatorname*{argmin}_{\beta \in \mathbb R^p} Q_\lambda(\beta),
\end{align*}
where $Q_\lambda:\mathbb R^p \to \mathbb R$ is the objective
\begin{align*}
Q_\lambda:\beta \mapsto \frac{1}{2n}|Y-X\beta|^2 + \lambda \|\beta\|_1.
\end{align*}
[/definition]
The penalty parameter $\lambda$ measures how much residual error we are willing to trade for a smaller $\ell_1$ norm. Large values of $\lambda$ force stronger shrinkage, while $\lambda=0$ reduces the problem to least squares, which may have many solutions when $p>n$.
[definition: Constrained Lasso]
For $t \ge 0$, a constrained Lasso estimator is any minimiser
\begin{align*}
\hat\beta_t \in \operatorname*{argmin}_{\beta \in \mathbb R^p:\,\|\beta\|_1 \le t} L_n(\beta).
\end{align*}
[/definition]
The constrained form makes the geometry visible: the least-squares loss is minimised over a polytope with corners on sparse coordinate axes. In two dimensions the feasible region is a diamond, and a smooth elliptical loss contour often first touches the diamond at a coordinate-axis corner; in higher dimensions the same mechanism occurs on faces of the $\ell_1$ ball that set many coordinates to zero. Since the two formulations use different tuning languages, the next result explains why they describe the same family of optimisation problems once the appropriate Lagrange multiplier is chosen. This equivalence lets us move between geometric intuition about an $\ell_1$ ball and statistical calibration through a penalty level.
[quotetheorem:5553]
[citeproof:5553]
This equivalence should not be read as a one-to-one matching between $\lambda$ and $t$ in all designs. The multiplier hypothesis in the converse matters: a constrained minimiser at radius $t$ need not correspond to an arbitrary penalty level, but only to a multiplier that satisfies stationarity and [complementary slackness](/theorems/2559) at that particular optimum. Complementary slackness is what prevents an inactive constraint from creating a spurious positive penalty. For a concrete failure, take $p=1$, $X^\top X/n=1$, and $X^\top Y/n=1$. At radius $t=2$, the constrained solution is $\hat\beta_t=1$ and the constraint is inactive. It solves the penalised problem only for $\lambda=0$; choosing any $\lambda>0$ would give the soft-thresholded value $1-\lambda$ when $0<\lambda<1$, or $0$ when $\lambda\ge 1$. Thus the condition $\lambda(\|\hat\beta_t\|_1-t)=0$ is not decorative: without it the converse statement is false.
Stationarity is equally necessary. In the same one-dimensional design at radius $t=0$, the only feasible point is $\hat\beta_t=0$, and it is constrained-optimal. It corresponds to penalised Lasso solutions only for multipliers $\lambda \ge 1$, because the KKT condition requires $1 \in \lambda[-1,1]$. If one chose $\lambda=1/2$, feasibility and complementary slackness would hold at $\hat\beta=0$, but stationarity would fail and the penalised minimiser would be $1/2$. Non-uniqueness can also create flat regions in the correspondence; for instance, if two columns of $X$ are identical, redistributing mass between their coefficients can leave the fitted value and the loss unchanged while changing which coefficient vector represents the same fit. At the extreme, when $\lambda=0$, every least-squares solution is penalised-optimal, but these solutions can have different $\ell_1$ norms. For statistical analysis, the penalised version remains the main object because $\lambda$ can be chosen above the random threshold $\|X^\top\varepsilon/n\|_\infty$, while the constrained radius is usually harder to calibrate directly from the noise.
[example: Penalised Radius Along a Path]
Suppose $X^\top X/n=I_p$, write $z=X^\top Y/n$, and fix $\lambda \ge 0$. By *Soft-Thresholding Formula for Orthonormal Design*, the penalised solution has coordinates
\begin{align*}
\hat\beta_j=S_\lambda(z_j)=\operatorname{sgn}(z_j)(|z_j|-\lambda)_+.
\end{align*}
We compute the $\ell_1$ radius of this solution. If $|z_j|>\lambda$, then $(|z_j|-\lambda)_+=|z_j|-\lambda$, and therefore
\begin{align*}
|\hat\beta_j|=\left|\operatorname{sgn}(z_j)(|z_j|-\lambda)\right|.
\end{align*}
Since $|z_j|>\lambda$ implies $z_j\ne 0$, we have $|\operatorname{sgn}(z_j)|=1$, so
\begin{align*}
|\hat\beta_j|=|z_j|-\lambda=(|z_j|-\lambda)_+.
\end{align*}
If $|z_j|\le \lambda$, then $(|z_j|-\lambda)_+=0$, hence
\begin{align*}
\hat\beta_j=\operatorname{sgn}(z_j)\cdot 0=0.
\end{align*}
Thus in this case also
\begin{align*}
|\hat\beta_j|=0=(|z_j|-\lambda)_+.
\end{align*}
Combining the two cases for every coordinate gives
\begin{align*}
\|\hat\beta_\lambda\|_1=\sum_{j=1}^p |\hat\beta_j|.
\end{align*}
Substituting the coordinate identity $|\hat\beta_j|=(|z_j|-\lambda)_+$ yields
\begin{align*}
\|\hat\beta_\lambda\|_1=\sum_{j=1}^p (|z_j|-\lambda)_+.
\end{align*}
Thus the constrained radius along the penalised path is
\begin{align*}
t(\lambda)=\sum_{j=1}^p (|z_j|-\lambda)_+.
\end{align*}
For a fixed coordinate $j$, the summand is $0$ when $\lambda\ge |z_j|$ and is $|z_j|-\lambda$ when $0\le\lambda<|z_j|$. Therefore, as $\lambda$ decreases, coordinate $j$ begins contributing to the radius exactly when $\lambda$ passes below its absolute empirical correlation $|z_j|$; in this orthonormal design, the penalty level and the constraint radius describe the same solution path.
[/example]
## KKT Conditions and Residual Correlations
To analyse the Lasso we need a local certificate for optimality that can be checked coordinate by coordinate. The loss is differentiable, but the $\ell_1$ norm has corners at zero. The right language is therefore subgradients, which record all supporting hyperplanes of a convex function at a point.
[definition: Subgradient]
Let $f:\mathbb R^p \to \mathbb R$ be convex. A vector $g \in \mathbb R^p$ is a subgradient of $f$ at $\beta \in \mathbb R^p$ if
\begin{align*}
f(\gamma) \ge f(\beta) + g^\top(\gamma-\beta) \quad \text{for all } \gamma \in \mathbb R^p.
\end{align*}
The set of all subgradients is denoted $\partial f(\beta)$.
[/definition]
Subgradients replace gradients at the corners of the penalty, so the next task is to compute them for the particular penalty used by the Lasso. For the $\ell_1$ norm, this computation is coordinatewise: nonzero coordinates have fixed signs, while zero coordinates allow a full interval of possible certificates.
[quotetheorem:5554]
[citeproof:5554]
The subgradient formula distinguishes coordinates with fixed signs from coordinates that are allowed to stay at zero. Separability is essential here: the $\ell_1$ norm decomposes into one-dimensional absolute values, so the subdifferential is a Cartesian product of coordinatewise conditions. If the penalty is instead the nonseparable Euclidean norm $g(\beta)=|\beta|$ on $\mathbb R^2$, then at $\beta=(0,0)$ the subdifferential is the disk $\{z\in\mathbb R^2: |z|\le 1\}$, not the square $[-1,1]^2$. The vector $(1,1)$ satisfies the coordinatewise bounds $z_1,z_2\in[-1,1]$ but is not a subgradient of $g$ at zero, since $|(1,1)|>1$. This counterexample shows why coordinatewise thresholds are a special feature of the $\ell_1$ penalty rather than a generic property of convex penalties.
The boundary case $\beta_j=0$ is the decisive one for sparsity, because the interval $[-1,1]$ permits an entire range of residual correlations to be certified without moving the coefficient away from zero. Mishandling this case destroys the Lasso optimality test. In the one-dimensional objective $\frac12(\beta-z)^2+\lambda|\beta|$, take $z=\lambda/2$. The minimiser is $\hat\beta=0$, because the KKT condition $z\in\lambda[-1,1]$ holds. If one incorrectly used only the derivative values $\{-1,1\}$ at zero, no certificate would exist, and the correct sparse solution would be rejected. To state the KKT equations in a way that separates the sign and interval roles, we name the set of nonzero fitted coefficients.
[definition: Active Set]
For a vector $\hat\beta \in \mathbb R^p$, its active set is
\begin{align*}
\hat S = \{j \in \{1,\dots,p\}: \hat\beta_j \ne 0\}.
\end{align*}
[/definition]
The active-set definition records which coordinates use the sign part of the subgradient formula and which coordinates use the interval part. This is needed in the next theorem, where the Lasso optimality condition becomes a residual-correlation equality on $\hat S$ and a residual-correlation inequality on its complement.
[explanation: KKT Conditions for the Lasso]
Let $X\in\mathbb R^{n\times p}$, $Y\in\mathbb R^n$, and $\lambda\geq 0$. A vector $\hat\beta\in\mathbb R^p$ minimises
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1
\end{align*}
if and only if there is a vector $\hat z\in\partial\|\hat\beta\|_1$ such that
\begin{align*}
\frac{1}{n}X^\top(Y-X\hat\beta)=\lambda \hat z.
\end{align*}
Equivalently, each active coordinate satisfies
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)=\lambda\,\operatorname{sgn}(\hat\beta_j)
\end{align*}
when $\hat\beta_j\neq 0$, while each inactive coordinate satisfies
\begin{align*}
\left|\frac{1}{n}X_j^\top(Y-X\hat\beta)\right|\leq \lambda
\end{align*}
when $\hat\beta_j=0$.
[/explanation]
The KKT equations are the main bridge between optimisation and statistics in the Lasso. Each hypothesis has a specific role. Convexity of the objective is what turns the first-order subgradient condition into a global optimality certificate, rather than merely a local condition. The assumption $\lambda \ge 0$ is needed because the $\ell_1$ term must remain a convex penalty; a negative coefficient would reward large $\ell_1$ norm and can make the objective fail to have a minimiser. For a concrete one-dimensional failure, take $p=n=1$, $X=0$, $Y=0$, and $\lambda=-1$. Then the objective is $Q_{-1}(\beta)=-|\beta|$, whose infimum is $-\infty$ and whose KKT-style stationarity cannot certify any minimiser because no minimiser exists. The requirement $\hat z \in \partial\|\hat\beta\|_1$ is what supplies the sign equations on active coordinates and the interval inequalities on inactive coordinates. Without that subgradient membership, the vector equation would not encode sparsity at zero.
Statistically, the inactive-coordinate inequalities say that no excluded variable can have residual correlation exceeding $\lambda$, so the regularisation parameter acts as a correlation threshold. The statement is still only an optimality certificate; it does not by itself imply that the active set equals the true support, nor that the selected signs are correct. When columns are highly correlated, two different active sets can produce the same fitted value and both satisfy the same residual-correlation inequalities. Later sparsistency and risk results add probability and design assumptions precisely to turn this deterministic certificate into statistical information.
[example: Residual Correlations at a Solution]
Assume $X^\top X/n=I_p$, set $z=X^\top Y/n$, and take the Lasso solution from *Soft-Thresholding Formula for Orthonormal Design*:
\begin{align*}
\hat\beta_j=S_\lambda(z_j)=\operatorname{sgn}(z_j)(|z_j|-\lambda)_+.
\end{align*}
For each coordinate $j$, the residual correlation is
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)=\frac{1}{n}X_j^\top Y-\frac{1}{n}X_j^\top X\hat\beta.
\end{align*}
Since $z_j=X_j^\top Y/n$ and $X^\top X/n=I_p$, the second term is the $j$th coordinate of $\hat\beta$, so
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)=z_j-\hat\beta_j.
\end{align*}
If $|z_j|>\lambda$, then $(|z_j|-\lambda)_+=|z_j|-\lambda$ and $z_j\ne 0$. Hence
\begin{align*}
z_j-\hat\beta_j=z_j-\operatorname{sgn}(z_j)(|z_j|-\lambda).
\end{align*}
Expanding the last term gives
\begin{align*}
z_j-\hat\beta_j=z_j-\operatorname{sgn}(z_j)|z_j|+\lambda\operatorname{sgn}(z_j).
\end{align*}
Because $z_j=\operatorname{sgn}(z_j)|z_j|$ when $z_j\ne 0$, this becomes
\begin{align*}
z_j-\hat\beta_j=\lambda\operatorname{sgn}(z_j).
\end{align*}
In this regime $\hat\beta_j=\operatorname{sgn}(z_j)(|z_j|-\lambda)$ is nonzero and has the same sign as $z_j$, so
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)=\lambda\operatorname{sgn}(\hat\beta_j).
\end{align*}
If $|z_j|\le \lambda$, then $(|z_j|-\lambda)_+=0$, so $\hat\beta_j=0$. Therefore
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)=z_j.
\end{align*}
The assumption $|z_j|\le \lambda$ gives
\begin{align*}
\left|\frac{1}{n}X_j^\top(Y-X\hat\beta)\right|=|z_j|\le \lambda.
\end{align*}
Thus active coordinates have residual correlation exactly equal to $\lambda$ times their fitted sign, while inactive coordinates have residual correlation bounded in absolute value by $\lambda$, matching the two coordinate conditions in the Lasso KKT characterisation.
[/example]
## The Basic Inequality for Penalised Empirical Risk
The KKT conditions describe the endpoint of the optimisation problem. For risk bounds, the starting point is even more elementary: the value of the penalised objective at the estimator is no larger than its value at any comparator. This comparison is called the basic inequality, and later oracle inequalities are refinements of it.
[quotetheorem:5555]
[citeproof:5555]
The assumptions identify exactly where the displayed prediction-error inequality comes from. Optimality of $\hat\beta$ is necessary for the comparator inequality: with $p=n=1$, $X=1$, $Y=1$, and $\lambda=0$, the least-squares solution is $\hat\beta=1$. If one incorrectly used $\tilde\beta=0$ as though it were optimal and compared it with $\beta=1$, the asserted inequality would read $1/2\le 0$, which is false. The model decomposition $Y=X\beta^*+\varepsilon$ is also necessary for the specialised form: it is what turns the difference of squared residuals into the noise [inner product](/page/Inner%20Product) plus prediction error. In the same one-dimensional example, take the true Lasso solution $\hat\beta=1$ but try to use $\beta^*=0$ and $\varepsilon=0$ even though $Y\ne X\beta^*+\varepsilon$. The specialised inequality would assert $1/2\le 0$, so the model equation cannot be omitted or replaced by an unrelated choice of noise vector.
The next probabilistic step, used in the cone argument of Chapter 3 and the oracle bounds of Chapter 4, is to bound the noise inner product by duality:
\begin{align*}
\frac{1}{n}\varepsilon^\top X(\hat\beta-\beta^*)
\le \left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\|\hat\beta-\beta^*\|_1.
\end{align*}
This is why the score concentration result from Chapter 1 naturally dictates a choice $\lambda$ of order $\sigma\sqrt{\log p/n}$ under Gaussian or sub-Gaussian noise, where $\sigma>0$ denotes the noise standard deviation in the Gaussian case and the corresponding sub-Gaussian scale parameter in the sub-Gaussian case. The basic inequality alone gives no rate, no sparsity conclusion, and no prediction bound, because the right-hand side may still be large. It becomes useful only after the score term is controlled and the design matrix supplies a lower bound on $|X(\hat\beta-\beta^*)|^2/n$ over the cone of plausible errors. In that sense, this theorem is the deterministic entry point for oracle inequalities rather than an oracle inequality by itself.
[remark: Statistical Meaning of the Basic Inequality]
The basic inequality separates the Lasso analysis into two parts. Probability controls the empirical process term $\|X^\top\varepsilon/n\|_\infty$, while geometry controls how $\|\hat\beta\|_1$ compares with $\|\beta^*\|_1$ on and off the true support. Restricted eigenvalue assumptions enter only after this separation has been made.
[/remark]
## Soft-Thresholding in the Orthonormal Design Case
The orthonormal design case is the clean model in which the Lasso can be solved coordinate by coordinate. It shows both the attraction and the cost of $\ell_1$ regularisation: small empirical coefficients are set to zero, while large ones are pulled toward zero by exactly $\lambda$.
[definition: Soft-Thresholding Operator]
For $\lambda \ge 0$, the soft-thresholding operator $S_\lambda:\mathbb R \to \mathbb R$ is
\begin{align*}
S_\lambda:u \mapsto \operatorname{sgn}(u)(|u|-\lambda)_+.
\end{align*}
The coordinatewise extension, also denoted $S_\lambda:\mathbb R^p \to \mathbb R^p$, is
\begin{align*}
S_\lambda:z \mapsto (S_\lambda(z_1),\dots,S_\lambda(z_p)).
\end{align*}
[/definition]
The operator has two regimes: inside $[-\lambda,\lambda]$ it returns zero, while outside that interval it preserves the sign and subtracts $\lambda$ from the magnitude. For the Lasso, the obstruction to such a coordinatewise formula is correlation between design columns: changing one coefficient usually changes the residual correlations of the others. When the empirical Gram matrix is the identity, that coupling disappears, so each KKT equation becomes a separate one-dimensional thresholding problem.
[quotetheorem:5556]
[citeproof:5556]
The formula is a useful warning. Even when a variable is selected, its coefficient is not an unbiased least-squares coefficient; it has been deliberately shrunk toward zero. The orthonormal hypothesis is doing substantial work, because it removes all interactions between coordinates and makes each KKT equation scalar. A two-variable counterexample shows the failure. Suppose $p=2$, the empirical Gram entries satisfy $X_1^\top X_1/n=X_2^\top X_2/n=1$ and $X_1^\top X_2/n=\rho$ with $0<\rho<1$, take $z=X^\top Y/n=(a,a)$ with $a>\lambda$, and suppose both fitted coefficients are positive. The KKT equations become
\begin{align*}
\hat\beta_1+\rho\hat\beta_2=a-\lambda,
\qquad
\rho\hat\beta_1+\hat\beta_2=a-\lambda.
\end{align*}
By symmetry, $\hat\beta_1=\hat\beta_2=(a-\lambda)/(1+\rho)$, while coordinatewise soft-thresholding would give $a-\lambda$ in both coordinates. These are equal only when $\rho=0$. Once $X^\top X/n$ has off-diagonal entries, changing one coefficient changes the residual correlations of the others, and soft-thresholding no longer solves the full system. The formula therefore gives the correct intuition for thresholding and shrinkage, but the next section explains why correlated designs require genuinely multivariate geometry.
[example: Bias from Shrinkage in One Coordinate]
Assume $p=1$, $X^\top X/n=1$, $Y=X\beta^*+\varepsilon$, and $\lambda\ge 0$. Then
\begin{align*}
z
=\frac{1}{n}X^\top Y
=\frac{1}{n}X^\top(X\beta^*+\varepsilon)
=\left(\frac{1}{n}X^\top X\right)\beta^*+\frac{1}{n}X^\top\varepsilon
=\beta^*+\frac{1}{n}X^\top\varepsilon.
\end{align*}
By *Soft-Thresholding Formula for Orthonormal Design*, the one-coordinate Lasso estimator is
\begin{align*}
\hat\beta=S_\lambda(z)=\operatorname{sgn}(z)(|z|-\lambda)_+.
\end{align*}
On the event $z>\lambda$, we have $z>0$, so $\operatorname{sgn}(z)=1$ and $|z|=z$. Hence
\begin{align*}
\hat\beta
=\operatorname{sgn}(z)(|z|-\lambda)_+
=1\cdot(z-\lambda)
=z-\lambda,
\end{align*}
and therefore
\begin{align*}
\hat\beta-z=(z-\lambda)-z=-\lambda.
\end{align*}
Thus, in this selected positive regime, the Lasso coefficient is shifted downward by exactly $\lambda$ relative to the unpenalised score $z$.
On the event $|z|\le \lambda$, we have $|z|-\lambda\le 0$, so
\begin{align*}
(|z|-\lambda)_+=0
\end{align*}
and therefore
\begin{align*}
\hat\beta=\operatorname{sgn}(z)\cdot 0=0.
\end{align*}
This example isolates the two sources of bias: thresholding bias from setting small scores to zero, and shrinkage bias from subtracting $\lambda$ after a coefficient has been selected.
[/example]
## Geometry with Correlated Variables
The orthonormal case hides a major feature of high-dimensional regression: covariates compete when they are correlated. In a correlated design, a variable can enter or leave the active set because another variable explains nearly the same direction in the response.
[example: Two Correlated Variables]
Let $p=2$, assume $|X_1|^2/n=|X_2|^2/n=1$, and let $X_1^\top X_2/n=\rho$ with $0<\rho<1$. Write $z_j=X_j^\top Y/n$. For $\beta=(\beta_1,\beta_2)$, the fitted vector is $X\beta=\beta_1X_1+\beta_2X_2$, so
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2=\frac{1}{2n}(Y-\beta_1X_1-\beta_2X_2)^\top(Y-\beta_1X_1-\beta_2X_2).
\end{align*}
Expanding the inner product gives
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2=\frac{1}{2n}Y^\top Y-\frac{\beta_1}{n}X_1^\top Y-\frac{\beta_2}{n}X_2^\top Y+\frac{\beta_1^2}{2n}X_1^\top X_1+\frac{\beta_1\beta_2}{n}X_1^\top X_2+\frac{\beta_2^2}{2n}X_2^\top X_2.
\end{align*}
Using $z_j=X_j^\top Y/n$, $X_1^\top X_1/n=X_2^\top X_2/n=1$, and $X_1^\top X_2/n=\rho$, this becomes
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2=\frac{1}{2n}|Y|^2-z_1\beta_1-z_2\beta_2+\frac12\beta_1^2+\rho\beta_1\beta_2+\frac12\beta_2^2.
\end{align*}
The term $|Y|^2/(2n)$ does not depend on $\beta$, so the Lasso objective has the same minimisers as
\begin{align*}
\frac12\beta_1^2+\rho\beta_1\beta_2+\frac12\beta_2^2-z_1\beta_1-z_2\beta_2+\lambda(|\beta_1|+|\beta_2|).
\end{align*}
Now restrict attention to a region where both fitted coefficients are positive. Then $|\beta_1|=\beta_1$ and $|\beta_2|=\beta_2$, so on this region the objective is
\begin{align*}
F(\beta_1,\beta_2)=\frac12\beta_1^2+\rho\beta_1\beta_2+\frac12\beta_2^2-z_1\beta_1-z_2\beta_2+\lambda\beta_1+\lambda\beta_2.
\end{align*}
Differentiating term by term with respect to $\beta_1$ gives
\begin{align*}
\frac{\partial F}{\partial \beta_1}=\beta_1+\rho\beta_2-z_1+\lambda.
\end{align*}
Differentiating term by term with respect to $\beta_2$ gives
\begin{align*}
\frac{\partial F}{\partial \beta_2}=\rho\beta_1+\beta_2-z_2+\lambda.
\end{align*}
At an interior minimiser with $\beta_1>0$ and $\beta_2>0$, both partial derivatives must equal zero. Therefore
\begin{align*}
\beta_1+\rho\beta_2-z_1+\lambda=0.
\end{align*}
Also,
\begin{align*}
\rho\beta_1+\beta_2-z_2+\lambda=0.
\end{align*}
Rearranging the first equation gives
\begin{align*}
\beta_1+\rho\beta_2=z_1-\lambda.
\end{align*}
Rearranging the second equation gives
\begin{align*}
\rho\beta_1+\beta_2=z_2-\lambda.
\end{align*}
Thus each coordinate's stationarity equation contains the other coefficient through the correlation term $\rho$. When $z_1$ and $z_2$ are close and $\rho$ is near $1$, increasing one coefficient changes the other coordinate's residual-correlation equation almost as strongly as increasing that coordinate itself, which is the algebraic source of competition between correlated variables.
[/example]
This competition is the geometric reason later chapters impose restricted eigenvalue or compatibility conditions. Such assumptions do not require every pair of columns to be weakly correlated, but they prevent sparse linear combinations from becoming too close to the kernel of $X$.
[remark: Corners of the Ell One Ball]
In two dimensions, the constrained Lasso minimises elliptical loss contours over a diamond-shaped $\ell_1$ ball. The diamond has corners on the coordinate axes, so the first point of contact often has one coordinate equal to zero. With correlated variables, the ellipses tilt, and the point of contact can switch between nearby corners as the tuning parameter changes.
[/remark]
The chapter has now supplied the optimisation language needed for the statistical theory. The KKT conditions give exact residual-correlation certificates, the basic inequality begins the finite-sample risk analysis, and the soft-thresholding formula shows how sparsity and shrinkage arise from the $\ell_1$ penalty. Chapter 3 takes the next step by combining these deterministic facts with the score concentration from Chapter 1 and geometric lower bounds on $X$ over sparse cones.
Once the Lasso is written in KKT form, the key remaining issue is geometric: the estimator must be controlled on a sparse cone rather than on all of ℝ^p. The next chapter turns that intuition into cone conditions and restricted eigenvalues, which quantify when the design matrix is informative enough for sparse recovery.
# 3. Cone Conditions and Restricted Eigenvalues
This chapter turns the KKT and basic inequality machinery from the Lasso into geometric control of the estimation error. The central question is not whether the empirical Gram matrix
\begin{align*}
\frac{1}{n}X^\top X
\end{align*}
is invertible on all of $\mathbb R^p$, since that is impossible when $p>n$, but whether it is well behaved on the restricted set of errors that the Lasso can produce. The chapter develops the cone condition, then compares the main restricted design assumptions used to convert prediction control into estimation control.
## The Lasso Cone from the Basic Inequality
The first problem is to understand why an $\ell_1$ penalty, which is imposed on the estimator rather than directly on the error, forces the error vector to point mostly along the true support. Let $Y=X\beta^*+\varepsilon$, let $\hat\beta$ be any Lasso solution,
\begin{align*}
\hat\beta \in \operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1\right\},
\end{align*}
and set $\Delta=\hat\beta-\beta^*$. If $S=\operatorname{supp}(\beta^*)$, then $\Delta_S$ records error on the true variables and $\Delta_{S^c}$ records false-variable error.
The basic inequality from Chapter 2 is the starting point because it uses only optimality of $\hat\beta$ and the linear model. Substituting $Y=X\beta^*+\varepsilon$ into the objective comparison at $\hat\beta$ and $\beta^*$ gives
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq \frac{1}{n}\varepsilon^\top X\Delta+
\lambda(\|\beta^*\|_1-\|\beta^*+\Delta\|_1).
\end{align*}
This inequality is algebraic; probability enters through a high-probability event controlling the empirical score $X^\top\varepsilon/n$.
[definition: Score Event]
For $\lambda>0$, the Lasso score event is
\begin{align*}
\mathcal E_\lambda=\left\{\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\leq \frac{\lambda}{2}\right\}.
\end{align*}
[/definition]
On $\mathcal E_\lambda$, Holder's inequality turns the stochastic term into an $\ell_1$ error term:
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
\leq \left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\|\Delta\|_1
\leq \frac{\lambda}{2}\|\Delta\|_1.
\end{align*}
The next result extracts the geometric consequence. It says that the Lasso error belongs to a cone whose aperture is controlled by the slack in the score bound.
[quotetheorem:5557]
[citeproof:5557]
The lemma is the main bridge from convex optimisation to sparse geometry. The score event is essential: if a noise column has $|X_j^\top\varepsilon/n|$ much larger than $\lambda$, then the stochastic linear term can reward movement in coordinate $j\notin S$ enough to overwhelm the penalty, and the cone conclusion can fail. The support assumption enters through $\beta^*_{S^c}=0$; without sparsity relative to $S$, the penalty difference would not separate true-coordinate error from false-coordinate error. The lemma does not say that $X$ identifies the support, that the Lasso solution is unique, or that $\Delta$ is small; it only restricts the possible directions of $\Delta$. The next step is therefore to ask whether the design has enough curvature on this cone to turn geometric restriction into numerical error bounds.
[example: Orthogonal Support Recovery Geometry]
Consider the idealised case $X^\top X/n=I_p$ with $p\leq n$, and set
\begin{align*}
z=\beta^*+\frac{1}{n}X^\top\varepsilon .
\end{align*}
Since $Y=X\beta^*+\varepsilon$, the Lasso objective can be rewritten as
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1
=\frac{1}{2n}|X(\beta-\beta^*)-\varepsilon|^2+\lambda\|\beta\|_1 .
\end{align*}
Expanding the square gives
\begin{align*}
\frac{1}{2n}|X(\beta-\beta^*)-\varepsilon|^2
=\frac{1}{2n}|X(\beta-\beta^*)|^2-\frac{1}{n}\varepsilon^\top X(\beta-\beta^*)+\frac{1}{2n}|\varepsilon|^2 .
\end{align*}
Using $X^\top X/n=I_p$, the first two $\beta$-dependent terms become
\begin{align*}
\frac{1}{2n}|X(\beta-\beta^*)|^2-\frac{1}{n}\varepsilon^\top X(\beta-\beta^*)
=\frac{1}{2}|\beta-\beta^*|^2-\left(\frac{1}{n}X^\top\varepsilon\right)^\top(\beta-\beta^*) .
\end{align*}
With $z=\beta^*+X^\top\varepsilon/n$, completing the square gives
\begin{align*}
\frac{1}{2}|\beta-\beta^*|^2-\left(\frac{1}{n}X^\top\varepsilon\right)^\top(\beta-\beta^*)
=\frac{1}{2}|\beta-z|^2+\frac{1}{2}|\beta^*|^2-\frac{1}{2}|z|^2 .
\end{align*}
The terms $|\varepsilon|^2/(2n)$, $|\beta^*|^2/2$, and $-|z|^2/2$ do not depend on $\beta$, so minimising the Lasso objective is equivalent to minimising
\begin{align*}
\sum_{j=1}^p\left\{\frac{1}{2}(\beta_j-z_j)^2+\lambda|\beta_j|\right\}.
\end{align*}
Thus the optimisation separates by coordinate. For one coordinate, minimise $q_j(b)=\frac{1}{2}(b-z_j)^2+\lambda|b|$. If $b>0$, then $q_j'(b)=b-z_j+\lambda$, so a positive critical point must be $b=z_j-\lambda$ and is valid exactly when $z_j>\lambda$. If $b<0$, then $q_j'(b)=b-z_j-\lambda$, so a negative critical point must be $b=z_j+\lambda$ and is valid exactly when $z_j<-\lambda$. At $b=0$, the subgradient condition is
\begin{align*}
0\in -z_j+\lambda[-1,1],
\end{align*}
which is equivalent to $|z_j|\leq\lambda$. Therefore
\begin{align*}
\hat\beta_j=\operatorname{sgn}(z_j)(|z_j|-\lambda)_+ .
\end{align*}
For an inactive coordinate $j\in S^c$, $\beta^*_j=0$, hence
\begin{align*}
z_j=\frac{1}{n}X_j^\top\varepsilon .
\end{align*}
On the score event,
\begin{align*}
|z_j|=\left|\frac{1}{n}X_j^\top\varepsilon\right|\leq \frac{\lambda}{2}<\lambda .
\end{align*}
The soft-thresholding formula then gives
\begin{align*}
\hat\beta_j=\operatorname{sgn}(z_j)(|z_j|-\lambda)_+=0 .
\end{align*}
So, in the orthogonal design, the score event prevents every inactive coordinate from entering the model; a false discovery can occur only off this event, when an inactive noise correlation is large enough to cross the threshold $\lambda$.
[/example]
This example uses an invertible Gram matrix, but the proof of the cone lemma did not. The rest of the chapter asks which weaker design assumptions are sufficient once the error has been restricted to the cone.
## Restricted Eigenvalue and Compatibility Conditions
The next problem is to convert prediction size $|X\Delta|/\sqrt n$ into coefficient size. In low-dimensional regression this is done using the minimum eigenvalue of $X^\top X/n$, but in high dimension the relevant lower bound only needs to hold on cone vectors.
[definition: Lasso Cone]
Let $S\subset\{1,\dots,p\}$ and $c>0$. The Lasso cone with support $S$ and aperture $c$ is
\begin{align*}
\mathcal C(S,c)=\{\delta\in\mathbb R^p:\|\delta_{S^c}\|_1\leq c\|\delta_S\|_1\}.
\end{align*}
[/definition]
The cone contains sparse vectors supported on $S$, but it also contains vectors with many small coordinates outside $S$. To turn this geometric restriction into an error bound, we next impose a lower bound on the design that is tested only over this cone rather than over every direction in $\mathbb R^p$.
[definition: Restricted Eigenvalue Constant]
Let $X\in\mathbb R^{n\times p}$. For $S\subset\{1,\dots,p\}$ and $c>0$, the restricted eigenvalue constant is
\begin{align*}
\kappa_{\mathrm{RE}}(S,c)=\inf_{\delta\in\mathcal C(S,c)\setminus\{0\}}\frac{|X\delta|}{\sqrt n\,|\delta|}.
\end{align*}
[/definition]
A positive restricted eigenvalue prevents a cone vector from hiding from the design while keeping mass on $S$. The first oracle inequality below is for prediction rather than Euclidean estimation, so we now introduce the version of restricted curvature whose denominator is the $\ell_1$ size naturally produced by the basic inequality.
[definition: Compatibility Constant]
Let $X\in\mathbb R^{n\times p}$. For $S\subset\{1,\dots,p\}$ with $s=|S|$ and $c>0$, the compatibility constant is
\begin{align*}
\phi_{\mathrm{comp}}(S,c)=\inf_{\delta\in\mathcal C(S,c)\setminus\{0\}}\frac{\sqrt{s}\,|X\delta|}{\sqrt n\,\|\delta_S\|_1}.
\end{align*}
[/definition]
Compatibility is tailored to prediction bounds because the basic inequality controls $\|\Delta_S\|_1$ before it controls $|\Delta_S|$. After the cone lemma has confined the error to $\mathcal C(S,3)$, the remaining obstacle is that an $\ell_1$ error term still appears on the right-hand side of the basic inequality. Compatibility is the curvature condition that converts this unresolved active-set $\ell_1$ term into prediction loss.
[quotetheorem:5558]
[citeproof:5558]
The rate is called fast because, if $\varepsilon_i$ are Gaussian with variance $\sigma^2$ and the columns of $X$ are normalised, the usual choice $\lambda\asymp \sigma\sqrt{\log p/n}$ gives prediction error of order $\sigma^2 s\log p/n$. The score event is not cosmetic: if $X_j^\top\varepsilon/n$ is much larger than $\lambda$ for some inactive column $j$, the Lasso can reduce the objective by fitting noise, and the prediction bound need not hold. Compatibility is also necessary for this argument; if a non-zero cone vector satisfies $X\delta=0$ while $\delta_S\neq 0$, then prediction loss cannot control the $\ell_1$ term left by the basic inequality. The theorem is limited to prediction error and does not by itself give Euclidean coefficient recovery, uniqueness, or exact support recovery. For estimation loss we need the design to separate cone directions strongly enough in coefficient space.
[quotetheorem:5559]
[citeproof:5559]
The exact numerical constant is less important than the dependence $\lambda\sqrt{s}/\kappa^2$. The restricted eigenvalue hypothesis is essential: if there is a non-zero $\delta\in\mathcal C(S,3)$ with $X\delta=0$, then $\beta^*$ and $\beta^*+t\delta$ have the same fitted values for every $t$, so coefficient error cannot be controlled from prediction information. The theorem still does not guarantee exact support recovery, sign recovery, or uniqueness of the Lasso solution; duplicated or highly correlated active and inactive variables can preserve the same prediction while changing the selected support. If $\varepsilon_i$ are Gaussian with variance $\sigma^2$ and the columns of $X$ are normalised, the usual choice $\lambda\asymp \sigma\sqrt{\log p/n}$ gives the familiar estimation rate $\sigma\sqrt{s\log p/n}$ up to the restricted-eigenvalue factor.
## Sparse Eigenvalues and Restricted Strong Convexity
The cone is natural for the Lasso, but other methods and proof strategies use related lower-curvature assumptions. The problem in this section is to compare them without treating them as interchangeable.
[definition: Sparse Eigenvalue]
Let $X\in\mathbb R^{n\times p}$. For an integer $m\geq 1$, the lower $m$-sparse eigenvalue of $X$ is
\begin{align*}
\rho_-(m)=\inf\left\{\frac{|X\delta|^2}{n|\delta|^2}:\delta\in\mathbb R^p\setminus\{0\},\ |\operatorname{supp}(\delta)|\leq m\right\}.
\end{align*}
[/definition]
Sparse eigenvalues control exactly sparse directions. They are useful for methods whose iterates or errors are known to be sparse, and they can imply cone conditions after approximation arguments, but the implication usually requires $m$ larger than $s$ and constants depending on the cone aperture.
[definition: Restricted Strong Convexity]
Let $L_n:\mathbb R^p\to\mathbb R$ be a convex empirical loss and let $\mathcal A\subset\mathbb R^p$ be a set of directions. The loss satisfies restricted strong convexity on $\mathcal A$ with curvature $\alpha>0$ and tolerance $\tau\geq 0$ if for all $\beta\in\mathbb R^p$ and all $\delta\in\mathcal A$,
\begin{align*}
L_n(\beta+\delta)-L_n(\beta)-\nabla L_n(\beta)^\top\delta
\geq \frac{\alpha}{2}|\delta|^2-\tau\|\delta\|_1^2.
\end{align*}
[/definition]
For squared loss, the left-hand side equals $|X\delta|^2/(2n)$, so restricted strong convexity is a lower bound on the empirical quadratic form. The tolerance term is useful for random designs where curvature may fail on tiny directions but remains adequate after the regularisation argument restricts the error.
[quotetheorem:5560]
[citeproof:5560]
This theorem explains why the same design geometry appears under several names. The zero-tolerance hypothesis is doing real work: if $\tau>0$, then the lower bound contains the negative term $-\tau\|\delta\|_1^2$, which can dominate $|\delta|^2$ on wide cone directions unless sparsity or a sharper cone bound absorbs it. A concrete failure occurs when $S=\{1\}$, $X_1=X_2$, and $\delta=e_1-e_2$. Then $\delta\in\mathcal C(S,3)$ because $\|\delta_{S^c}\|_1=\|\delta_S\|_1=1$, but $X\delta=0$, so the squared loss is flat along a statistically admissible error direction. No Euclidean estimation theorem can follow from this argument in such a direction. The result is also limited to squared loss with zero tolerance; for logistic regression and other empirical losses, RSC is a local or approximate curvature statement rather than a literal eigenvalue condition. The same idea reappears in empirical risk minimisation beyond linear regression: the loss supplies curvature only on the directions allowed by the regulariser, as in matrix completion with nuclear-norm cones or sparse graphical-model estimation with decomposable penalties. The forward use is that the oracle inequalities in Chapter 4 replace exact Gram-matrix eigenvalues by whichever restricted curvature inequality matches the loss under study.
[example: Block-Correlated Design With Restricted Curvature]
Take a concrete four-column design whose empirical Gram matrix $G=X^\top X/n$ satisfies
\begin{align*}
G_{11}=G_{22}=G_{33}=G_{44}=1,\quad G_{12}=G_{21}=\rho,\quad G_{34}=G_{43}=1,
\end{align*}
with all other off-diagonal entries equal to $0$, where $0\leq \rho<1$. Let the true support be $S=\{1\}$. Since the third and fourth columns are duplicated in the Gram geometry,
\begin{align*}
G(e_3-e_4)=0,
\end{align*}
so $G$ is singular and the design is not globally invertible.
Now let $\delta=(\delta_1,\delta_2,\delta_3,\delta_4)\in\mathcal C(S,3)$. The cone condition says
\begin{align*}
|\delta_2|+|\delta_3|+|\delta_4|\leq 3|\delta_1|.
\end{align*}
In particular, if $\delta_1=0$, then $|\delta_2|+|\delta_3|+|\delta_4|\leq 0$, hence $\delta_2=\delta_3=\delta_4=0$. Thus every non-zero cone vector has $\delta_1\neq 0$.
For such a vector, the empirical quadratic form is
\begin{align*}
\frac{1}{n}|X\delta|^2=\delta^\top G\delta=\delta_1^2+2\rho\delta_1\delta_2+\delta_2^2+(\delta_3+\delta_4)^2.
\end{align*}
The first three terms can be rewritten as
\begin{align*}
\delta_1^2+2\rho\delta_1\delta_2+\delta_2^2=(1-\rho)(\delta_1^2+\delta_2^2)+\rho(\delta_1+\delta_2)^2.
\end{align*}
Since $\rho\geq 0$ and $(\delta_1+\delta_2)^2\geq 0$, this gives
\begin{align*}
\frac{1}{n}|X\delta|^2\geq (1-\rho)(\delta_1^2+\delta_2^2).
\end{align*}
The cone condition also controls the duplicated-block coordinates:
\begin{align*}
|\delta_3|+|\delta_4|\leq 3|\delta_1|-|\delta_2|\leq 3|\delta_1|.
\end{align*}
Therefore
\begin{align*}
\delta_3^2+\delta_4^2\leq (|\delta_3|+|\delta_4|)^2\leq 9\delta_1^2.
\end{align*}
Together with $|\delta_2|\leq 3|\delta_1|$, this yields
\begin{align*}
|\delta|^2=\delta_1^2+\delta_2^2+\delta_3^2+\delta_4^2\leq \delta_1^2+\delta_2^2+9\delta_1^2\leq 10(\delta_1^2+\delta_2^2).
\end{align*}
Combining the lower bound on $|X\delta|^2/n$ with this upper bound on $|\delta|^2$ gives
\begin{align*}
\frac{|X\delta|^2}{n|\delta|^2}\geq \frac{(1-\rho)(\delta_1^2+\delta_2^2)}{10(\delta_1^2+\delta_2^2)}=\frac{1-\rho}{10}.
\end{align*}
Hence $\kappa_{\mathrm{RE}}(S,3)\geq \sqrt{(1-\rho)/10}>0$, even though the full Gram matrix is singular. The duplicated unused block creates global non-identifiability, but the cone tied to $S=\{1\}$ prevents the error from living entirely in that null direction.
[/example]
The example separates global invertibility from restricted curvature. What matters for estimating a particular sparse vector is not whether every direction in $\mathbb R^p$ is identifiable, but whether the directions compatible with that vector's support are identifiable.
## Why Ordinary Eigenvalues Fail in High Dimension
The final problem is to understand why high-dimensional regression cannot rely on the minimum eigenvalue of $X^\top X/n$. This is not a technical inconvenience; it is a linear-algebraic obstruction.
[quotetheorem:5561]
[citeproof:5561]
This theorem is the reason restricted assumptions are not optional in the $p>n$ regime. Any theorem demanding a positive global minimum eigenvalue would exclude the regime the course is designed to study. The conclusion does not say that estimation is hopeless: the null vector supplied by rank-nullity may be dense and may lie far outside the Lasso cone associated with the true support. For example, a design can be globally singular because it has many irrelevant redundant columns, while still having positive restricted curvature on $\mathcal C(S,3)$. The theorem therefore marks the boundary between ordinary low-dimensional linear algebra and the restricted geometry needed for sparse regression.
[example: Duplicated Columns and Failed Support Recovery]
Let $X_1=X_2$ and let $\beta^*=ae_1$ with $a\neq 0$. For each $\theta\in[0,1]$, define
\begin{align*}
\beta^{(\theta)}=a\theta e_1+a(1-\theta)e_2 .
\end{align*}
Then $Xe_1=X_1$ and $Xe_2=X_2$, so
\begin{align*}
X\beta^{(\theta)}=a\theta X_1+a(1-\theta)X_2 .
\end{align*}
Using $X_2=X_1$ gives
\begin{align*}
X\beta^{(\theta)}=a\theta X_1+a(1-\theta)X_1=aX_1=X(ae_1)=X\beta^* .
\end{align*}
Hence every vector $\beta^{(\theta)}$ has the same prediction vector as $\beta^*$, and therefore
\begin{align*}
|X\beta^{(\theta)}-X\beta^*|=0 .
\end{align*}
The $\ell_1$ penalty also does not distinguish these representations. Since $\theta\in[0,1]$, the coefficients $a\theta$ and $a(1-\theta)$ have the same sign as $a$, so
\begin{align*}
\|\beta^{(\theta)}\|_1=|a\theta|+|a(1-\theta)| .
\end{align*}
Because $\theta\geq 0$ and $1-\theta\geq 0$, this becomes
\begin{align*}
\|\beta^{(\theta)}\|_1=|a|\theta+|a|(1-\theta)=|a| .
\end{align*}
Thus $\|ae_1\|_1=\|ae_2\|_1=\|\beta^{(\theta)}\|_1=|a|$.
For any response vector $Y$, the fitted-value term is the same for all these choices because $X\beta^{(\theta)}=X\beta^*$, and the penalty term is the same because $\|\beta^{(\theta)}\|_1=|a|$. Therefore
\begin{align*}
\frac{1}{2n}|Y-X\beta^{(\theta)}|^2+\lambda\|\beta^{(\theta)}\|_1=\frac{1}{2n}|Y-X\beta^*|^2+\lambda|a| .
\end{align*}
The choices $\theta=1$ and $\theta=0$ have supports $\{1\}$ and $\{2\}$, while every $0<\theta<1$ has support $\{1,2\}$. Thus duplicated columns can leave prediction and the $\ell_1$ objective unchanged while changing the selected support, so exact support recovery requires assumptions that separate active variables from their inactive duplicates.
[/example]
This example shows that restricted eigenvalue conditions are estimation and prediction assumptions, not full model-selection assumptions. To recover the exact support, later chapters introduce stronger conditions such as irrepresentability or beta-min assumptions.
[remark: Hierarchy of Design Assumptions]
Global positive definiteness implies all restricted lower-curvature conditions, but it is unavailable when $p>n$. Restricted eigenvalue conditions are strong enough for Euclidean estimation; compatibility conditions are often enough for fast prediction; sparse eigenvalues control exactly sparse directions; restricted strong convexity generalises these ideas beyond squared loss. The correct assumption is therefore chosen according to the target guarantee, not according to a single universal ordering.
[/remark]
The chapter's conclusion is that the Lasso succeeds through a two-step mechanism. The penalty first forces the error into a cone, and the design only needs enough curvature on that cone. This mechanism is the template for the Lasso oracle inequalities proved next.
The cone geometry now provides the deterministic mechanism needed for statistical guarantees: the error is forced into a region where the design has curvature, and the score term is already controlled by concentration. The next chapter turns this mechanism into oracle inequalities, comparing the Lasso to an ideal sparse benchmark in prediction and estimation risk.
# 4. Oracle Inequalities for Lasso
Oracle inequalities turn the deterministic basic inequality from Chapter 2 and the cone geometry from Chapter 3 into statistical risk bounds. The guiding question is not whether the Lasso recovers the exact support, but how well its fitted values and coefficients compare with an ideal sparse benchmark. This chapter separates the slow-rate result, which needs no geometry beyond column normalisation, from fast-rate results, where restricted eigenvalue or compatibility assumptions convert prediction control into sharper sparse rates.
## Slow-Rate Prediction Without Restricted Eigenvalues
The first question is what can be proved before imposing any restricted eigenvalue condition on the design. In high dimensions the Gram matrix $X^\top X/n$ may be singular, so estimation of $\beta^*$ in Euclidean norm is impossible without extra structure. Prediction is more forgiving: the basic inequality already controls $X(\hat\beta-\beta^*)$, provided the noise correlations are dominated by the Lasso penalty.
Throughout this section let $Y=X\beta^*+\varepsilon$, where $X\in\mathbb R^{n\times p}$ is fixed, $\beta^*\in\mathbb R^p$, and $\varepsilon\in\mathbb R^n$ is random noise. The Lasso estimator is any solution
\begin{align*}
\hat\beta\in\operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1\right\}.
\end{align*}
The notation $|\cdot|$ denotes the Euclidean norm on $\mathbb R^n$, while $\|\cdot\|_1$ denotes the $\ell_1$ norm on $\mathbb R^p$.
The basic inequality leaves a stochastic inner product between the noise and the fitted error. We isolate the event on which that term is controlled coordinatewise.
[definition: Score Event]
Let $X\in\mathbb R^{n\times p}$, $\varepsilon\in\mathbb R^n$, and $\lambda>0$. The score event at level $\lambda$ is
\begin{align*}
\mathcal E_\lambda=\left\{\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty\le \frac{\lambda}{2}\right\}.
\end{align*}
[/definition]
The score event is useful because it makes the random part of the basic inequality no larger than the deterministic penalty. This is enough to ask for a prediction bound that works for any design matrix, even when sparse directions are not identifiable.
[quotetheorem:5562]
[citeproof:5562]
This theorem gives a valid finite-sample oracle inequality even when the columns of $X$ are nearly collinear. The score event is essential for this deterministic statement: if a column $X_j$ has large correlation with the noise, the Lasso can reduce the residual by putting mass on that coordinate, so no bound of this form can hold uniformly at the same $\lambda$. The theorem also does not estimate $\beta^*$ in Euclidean norm; for instance, if two columns of $X$ are identical, changing the coefficients on those two columns in opposite directions leaves the fitted values unchanged. Its weakness is visible in the right-hand side: two vectors with the same support size can have very different $\ell_1$ norms, so the next step is to understand the tuning scale through a concrete noise calculation.
[example: Gaussian Choice of Lambda for the Slow Rate]
Assume $|X_j|^2/n\le 1$ for every column $X_j$ and $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$. For each $j$,
\begin{align*}
\frac{X_j^\top\varepsilon}{n}=\frac{1}{n}\sum_{i=1}^n X_{ij}\varepsilon_i.
\end{align*}
This is a centered Gaussian random variable because it is a linear combination of the jointly Gaussian coordinates of $\varepsilon$. Since the noise coordinates are independent and each has variance $\sigma^2$,
\begin{align*}
\operatorname{Var}\left(\frac{X_j^\top\varepsilon}{n}\right)=\frac{1}{n^2}\sum_{i=1}^n X_{ij}^2\operatorname{Var}(\varepsilon_i).
\end{align*}
Substituting $\operatorname{Var}(\varepsilon_i)=\sigma^2$ gives
\begin{align*}
\operatorname{Var}\left(\frac{X_j^\top\varepsilon}{n}\right)=\frac{\sigma^2}{n^2}|X_j|^2.
\end{align*}
The column normalisation $|X_j|^2/n\le 1$ is equivalent to $|X_j|^2\le n$, so
\begin{align*}
\operatorname{Var}\left(\frac{X_j^\top\varepsilon}{n}\right)\le \frac{\sigma^2}{n}.
\end{align*}
Let
\begin{align*}
t=\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
If $Z\sim\mathcal N(0,\tau^2)$ with $\tau^2\le \sigma^2/n$, the standard Gaussian tail bound gives
\begin{align*}
\mathbb P(|Z|>t)\le 2\exp\left(-\frac{t^2}{2\tau^2}\right).
\end{align*}
Since $\tau^2\le \sigma^2/n$, we have $1/\tau^2\ge n/\sigma^2$, hence
\begin{align*}
2\exp\left(-\frac{t^2}{2\tau^2}\right)\le 2\exp\left(-\frac{nt^2}{2\sigma^2}\right).
\end{align*}
By the definition of $t$,
\begin{align*}
\frac{nt^2}{2\sigma^2}=\log(2p/\delta).
\end{align*}
Therefore
\begin{align*}
\mathbb P(|Z|>t)\le 2\exp(-\log(2p/\delta))=\frac{\delta}{p}.
\end{align*}
Now choose
\begin{align*}
\lambda=2\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
Then $\lambda/2=t$. Using the union bound over the $p$ coordinates,
\begin{align*}
\mathbb P(\mathcal E_\lambda^c)\le \sum_{j=1}^p \mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|>\frac{\lambda}{2}\right).
\end{align*}
Each summand is at most $\delta/p$, so
\begin{align*}
\mathbb P(\mathcal E_\lambda^c)\le \sum_{j=1}^p \frac{\delta}{p}=\delta.
\end{align*}
Thus $\mathbb P(\mathcal E_\lambda)\ge 1-\delta$.
On $\mathcal E_\lambda$, the *[Slow-Rate Lasso Prediction Inequality](/theorems/5562)* gives
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2\le 4\lambda\|\beta^*\|_1.
\end{align*}
Substituting the chosen value of $\lambda$ gives
\begin{align*}
4\lambda\|\beta^*\|_1=4\left(2\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}\right)\|\beta^*\|_1.
\end{align*}
Multiplying the constants,
\begin{align*}
4\left(2\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}\right)\|\beta^*\|_1=8\sigma\|\beta^*\|_1\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
Hence, with probability at least $1-\delta$,
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2\le 8\sigma\|\beta^*\|_1\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
This is the slow-rate prediction scale $\sigma\|\beta^*\|_1\sqrt{\log p/n}$, with the confidence level entering through $\log(2p/\delta)$.
[/example]
The example explains the common scaling $\lambda\asymp \sigma\sqrt{\log p/n}$. The factor $\sqrt{\log p}$ is the cost of controlling $p$ empirical correlations at once; it is the first place where high dimensionality enters the oracle theory.
## Fast-Rate Bounds from Sparse Geometry
The next question is what extra condition lets the Lasso achieve a rate depending on the number of active variables rather than on $\|\beta^*\|_1$. The basic inequality gives a cone constraint: the error vector $\Delta=\hat\beta-\beta^*$ has most of its $\ell_1$ mass on the true support. A restricted eigenvalue or compatibility condition says that vectors in this cone cannot have small prediction norm unless their active coordinates are small.
Let $S=\operatorname{supp}(\beta^*)$ and $s=|S|$. For $\Delta\in\mathbb R^p$, write $\Delta_S$ for the vector agreeing with $\Delta$ on $S$ and vanishing on $S^c$.
The slow-rate bound did not identify the shape of the error vector. To get a sparsity-dependent rate, we first need to show that the inactive coordinates of the error are controlled by the active coordinates.
[quotetheorem:5557]
[citeproof:5557]
The cone condition says that the error is effectively sparse even if $\hat\beta$ itself is not exactly sparse. The exact support assumption on $\beta^*$ matters here: if $\beta^*$ has many small nonzero coordinates outside $S$, the decomposability step produces an additional approximation term rather than a pure cone. The result also does not say that $\hat\beta_{S^c}=0$; it only bounds the total off-support $\ell_1$ error by the on-support error. To turn this into a rate, we need a lower bound on $|X\Delta|/\sqrt n$ for vectors inside this cone, since identical or nearly identical columns can still make a cone vector almost invisible in prediction.
[definition: Compatibility Constant]
Let $X\in\mathbb R^{n\times p}$, let $S\subset\{1,\dots,p\}$ with $|S|=s$, and let $c_0>0$. The compatibility constant $\phi_0(S,c_0)$ is the largest number $\phi\ge 0$ such that every $\Delta\in\mathbb R^p$ satisfying $\|\Delta_{S^c}\|_1\le c_0\|\Delta_S\|_1$ also satisfies
\begin{align*}
\frac{|X\Delta|^2}{n}\ge \frac{\phi^2}{s}\|\Delta_S\|_1^2.
\end{align*}
[/definition]
Compatibility is tailored to prediction bounds because it relates the prediction norm directly to $\|\Delta_S\|_1$. For coefficient estimation we also need a version that controls Euclidean size on the active coordinates, which motivates the restricted eigenvalue constant.
[definition: Restricted Eigenvalue Constant]
Let $X\in\mathbb R^{n\times p}$, let $S\subset\{1,\dots,p\}$, and let $c_0>0$. The restricted eigenvalue constant $\kappa(S,c_0)$ is the largest number $\kappa\ge 0$ such that every $\Delta\in\mathbb R^p$ satisfying $\|\Delta_{S^c}\|_1\le c_0\|\Delta_S\|_1$ also satisfies
\begin{align*}
\frac{|X\Delta|^2}{n}\ge \kappa^2|\Delta_S|^2.
\end{align*}
[/definition]
By Cauchy--Schwarz, $\|\Delta_S\|_1^2\le s|\Delta_S|^2$, so a positive restricted eigenvalue implies a positive compatibility constant, with comparable constants. Since the basic inequality controls $\|\Delta_S\|_1$ through the penalty, compatibility is the condition that can now be inserted directly into the Lasso argument to produce a fast prediction rate.
[quotetheorem:5563]
[citeproof:5563]
With $\lambda\asymp\sigma\sqrt{\log p/n}$, this theorem gives prediction error of order $\sigma^2s\log p/n$ when the compatibility constant is bounded away from zero. The compatibility hypothesis is doing real work: if a nonzero cone vector lies in the null space of $X$, then the same argument would try to lower-bound a zero prediction norm by a positive active-coordinate norm, which is impossible. The theorem still concerns fitted values, not coefficient recovery; different coefficient vectors can give the same $X\beta$ whenever the design is not identifiable in the relevant directions. Prediction control is often the main statistical target, but if the goal is to estimate the coefficient vector itself, we need to combine the cone condition with stronger restricted eigenvalue control.
[quotetheorem:5564]
[citeproof:5564]
The $\ell_2$ parameter rate is $s\lambda^2$, while the $\ell_1$ parameter rate is $s\lambda$. The stronger full-vector restricted eigenvalue hypothesis is necessary for the displayed Euclidean conclusion: the active-coordinate restricted eigenvalue only controls $|\Delta_S|$, and a cone vector may have many small off-support entries whose Euclidean contribution is not bounded by that condition alone. Thus this theorem is a coefficient-estimation result, not merely a prediction result, and it excludes designs with nearly invisible cone directions in the full parameter norm. These rates become numerically meaningful once $\lambda$ is replaced by its Gaussian concentration scale, which also gives a practical sample-size calculation.
[example: Sample Size for a Target Prediction Error]
Suppose $\phi_0(S,3)\ge \phi>0$, $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$, and $|X_j|^2/n\le 1$ for every column $X_j$. From the Gaussian score calculation, the choice
\begin{align*}
\lambda=2\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}
\end{align*}
ensures $\mathbb P(\mathcal E_\lambda)\ge 1-\delta$.
On $\mathcal E_\lambda$, the *Fast-Rate Lasso Prediction Under Compatibility* theorem gives
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2\le \frac{9\lambda^2s}{\phi_0(S,3)^2}.
\end{align*}
Since $\phi_0(S,3)\ge \phi>0$, we have $\phi_0(S,3)^2\ge \phi^2$, so
\begin{align*}
\frac{9\lambda^2s}{\phi_0(S,3)^2}\le \frac{9\lambda^2s}{\phi^2}.
\end{align*}
Now square the chosen value of $\lambda$:
\begin{align*}
\lambda^2=\left(2\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}\right)^2.
\end{align*}
Using $(ab)^2=a^2b^2$ and $(\sqrt u)^2=u$ for $u\ge 0$,
\begin{align*}
\lambda^2=(2\sigma)^2\frac{2\log(2p/\delta)}{n}.
\end{align*}
Since $(2\sigma)^2=4\sigma^2$,
\begin{align*}
\lambda^2=\frac{8\sigma^2\log(2p/\delta)}{n}.
\end{align*}
Substituting this into the fast-rate bound gives
\begin{align*}
\frac{9\lambda^2s}{\phi^2}=\frac{9s}{\phi^2}\cdot \frac{8\sigma^2\log(2p/\delta)}{n}.
\end{align*}
Multiplying the constants $9\cdot 8=72$ yields
\begin{align*}
\frac{9\lambda^2s}{\phi^2}=\frac{72\sigma^2s\log(2p/\delta)}{\phi^2 n}.
\end{align*}
Combining the preceding inequalities, with probability at least $1-\delta$,
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2\le \frac{72\sigma^2s\log(2p/\delta)}{\phi^2 n}.
\end{align*}
To guarantee that this upper bound is at most a target level $\eta>0$, it is enough to require
\begin{align*}
\frac{72\sigma^2s\log(2p/\delta)}{\phi^2 n}\le \eta.
\end{align*}
Because $n>0$, multiplying both sides by $n$ gives
\begin{align*}
\frac{72\sigma^2s\log(2p/\delta)}{\phi^2}\le \eta n.
\end{align*}
Because $\eta>0$, dividing by $\eta$ gives
\begin{align*}
n\ge \frac{72\sigma^2s\log(2p/\delta)}{\phi^2\eta}.
\end{align*}
This sufficient sample-size condition displays the roles of sparsity $s$, noise level $\sigma^2$, ambient dimension $p$, compatibility strength $\phi$, target error $\eta$, and confidence level $1-\delta$.
[/example]
The sample-size formula should be read as a sufficient condition, not as an exact threshold. It is useful because it makes the effective dimension $s$ visible: the Lasso pays for the number of relevant variables and a logarithmic search over the irrelevant ones.
## Oracle Interpretation
The final question is why these inequalities are called oracle inequalities. An oracle that knew the true support $S$ in advance would fit a low-dimensional regression on $S$ and pay a variance cost of order $\sigma^2s/n$. The Lasso does not know $S$, so it pays the inflated cost $\sigma^2s\log p/n$, reflecting the need to search across many possible supports.
The oracle viewpoint is more flexible than exact sparsity. Instead of comparing only with the true vector $\beta^*$, we can compare with any sparse linear predictor $X\beta$, allowing an approximation term for bias and a complexity term for sparsity.
[definition: Sparse Oracle Tradeoff]
For $m\in\{1,\dots,p\}$ and $\beta\in\mathbb R^p$, let $\|\beta\|_0=|\{j:\beta_j\ne0\}|$. A sparse oracle tradeoff has the form
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2\le C\inf_{\beta\in\mathbb R^p}\left\{\frac{1}{n}|X(\beta-\beta^*)|^2+\lambda^2\|\beta\|_0\right\},
\end{align*}
with constants depending on the relevant sparse geometry condition.
[/definition]
The first term is approximation error: how well the chosen sparse vector predicts the truth. The second is estimation error: the price of estimating the selected coefficients after searching over $p$ variables. The natural theorem asks whether the Lasso satisfies this tradeoff uniformly over sparse comparison vectors, rather than only at the true vector.
[quotetheorem:5565]
[citeproof:5565]
This theorem says that the Lasso adapts to the best sparse predictor available in the model class. The uniform compatibility assumption is needed because the comparator support is not fixed in advance; if some support of size at most $m$ contains a nearly null sparse direction, the complexity term cannot control the estimation part on that support. The theorem also does not identify which sparse model is best, nor does it guarantee exact support recovery; it only compares prediction risk with the oracle tradeoff. If $\beta^*$ is exactly sparse, choosing $\beta=\beta^*$ recovers the fast-rate theorem. If $\beta^*$ is only approximately sparse, the infimum balances how many coefficients are retained against the prediction bias from discarding the rest.
[example: Approximate Sparsity and the Bias-Variance Balance]
Assume $|\beta^*|_{(j)}\le Aj^{-a}$ with $A>0$ and $a>1/2$, and let $\beta_k$ keep the $k$ largest coordinates of $\beta^*$ and set all other coordinates to zero, where $1\le k\le m$. Since $\|\beta_k\|_0\le k$, the oracle inequality compares the Lasso with
\begin{align*}
\frac{1}{n}|X(\beta_k-\beta^*)|^2+\frac{\lambda^2 k}{\phi^2}.
\end{align*}
Suppose also that $|Xv|^2/n\le L^2|v|^2$ for every $v\in\mathbb R^p$. Applying this with $v=\beta_k-\beta^*$ gives
\begin{align*}
\frac{1}{n}|X(\beta_k-\beta^*)|^2\le L^2|\beta_k-\beta^*|^2.
\end{align*}
Because $\beta_k$ agrees with $\beta^*$ on the $k$ largest coordinates and is zero on the remaining coordinates,
\begin{align*}
|\beta_k-\beta^*|^2=\sum_{j>k}|\beta^*|_{(j)}^2.
\end{align*}
Using $|\beta^*|_{(j)}\le Aj^{-a}$ for each $j$,
\begin{align*}
\sum_{j>k}|\beta^*|_{(j)}^2\le \sum_{j>k}A^2j^{-2a}=A^2\sum_{j=k+1}^{\infty}j^{-2a}.
\end{align*}
Since $2a>1$, the decreasing integral comparison gives
\begin{align*}
\sum_{j=k+1}^{\infty}j^{-2a}\le \int_k^\infty x^{-2a}\,dx.
\end{align*}
Evaluating the integral,
\begin{align*}
\int_k^\infty x^{-2a}\,dx=\left[\frac{x^{1-2a}}{1-2a}\right]_{x=k}^{x=\infty}.
\end{align*}
Because $1-2a<0$, we have $\lim_{x\to\infty}x^{1-2a}=0$, so
\begin{align*}
\left[\frac{x^{1-2a}}{1-2a}\right]_{x=k}^{x=\infty}=0-\frac{k^{1-2a}}{1-2a}=\frac{k^{1-2a}}{2a-1}.
\end{align*}
Combining these bounds,
\begin{align*}
\frac{1}{n}|X(\beta_k-\beta^*)|^2\le \frac{A^2L^2}{2a-1}k^{1-2a}.
\end{align*}
Thus one may take
\begin{align*}
C_{A,a,L}=\frac{A^2L^2}{2a-1}.
\end{align*}
For this comparator, the oracle tradeoff is bounded by
\begin{align*}
C_{A,a,L}k^{1-2a}+\frac{\lambda^2 k}{\phi^2}.
\end{align*}
The first term decreases with $k$ because $1-2a<0$, while the second increases linearly in $k$. The oracle choice of $k$ balances the tail bias from discarding small coefficients against the estimation cost of keeping more coordinates, and the Lasso inherits that tradeoff up to universal constants and the geometry factor.
[/example]
The oracle language also clarifies the role of tuning. A larger $\lambda$ improves noise domination but increases the complexity term; a smaller $\lambda$ reduces shrinkage but may fail to control the empirical score. The usual choice $\lambda\asymp\sigma\sqrt{\log p/n}$ is the point at which the high-probability score event and the oracle complexity term are matched.
[remark: Slow and Fast Rates Compared]
The slow-rate inequality requires no restricted eigenvalue assumption and gives a bound of order $\sigma\|\beta^*\|_1\sqrt{\log p/n}$. The fast-rate inequality requires sparse geometry and gives a prediction bound of order $\sigma^2s\log p/n$. Neither result dominates the other for every parameter vector and design: the slow rate is robust to degeneracy, while the fast rate captures the statistical advantage of sparse well-conditioned directions.
[/remark]
The chapter's main message is that Lasso oracle inequalities have two ingredients. Concentration of the score fixes the tuning scale, and sparse geometry converts the deterministic basic inequality into fast prediction and estimation rates. Later chapters refine both ingredients: sharper design assumptions lead to support recovery and debiasing, while alternative penalties modify the bias and geometry of the estimator.
Oracle inequalities show that the Lasso can estimate well even when exact recovery is out of reach, but they also make clear that prediction error and variable selection are different goals. The next chapter narrows the focus to support recovery and sign consistency, where stronger separation conditions are needed to identify the nonzero coordinates.
# 5. Variable Selection and Sign Recovery
This chapter shifts the target from predicting $X\beta^*$ well to identifying which coordinates of $\beta^*$ are nonzero. Chapters 3 and 4 showed that the Lasso can achieve sharp prediction and estimation rates under restricted eigenvalue assumptions, even when exact model recovery is too much to ask. Here the question is more delicate: when does regularisation recover the active set and the signs of the coefficients, and when do correlations in the design make that target unattainable?
The main theme is that variable selection is a stronger demand than prediction. A small coefficient can be harmless for prediction yet impossible to distinguish from zero, while a highly correlated inactive variable can imitate the effect of an active one. The irrepresentable condition and the beta-min condition isolate these two obstructions.
## Model Selection and Prediction Ask Different Questions
Prediction theory measures the fitted vector $X\hat{\beta}$, so it can tolerate many coefficient vectors that produce similar fitted values. Variable selection instead asks for the exact coordinates of the signal, so it depends on identifiability at the level of columns of $X$ and on the size of the smallest nonzero coefficient.
[definition: Support and Sign Vector]
Let $\beta^* \in \mathbb R^p$. Its support is
\begin{align*}
S := \{j \in \{1,\dots,p\}: \beta^*_j \ne 0\}.
\end{align*}
The sign vector is $\operatorname{sign}(\beta^*) \in \{-1,0,1\}^p$, with $\operatorname{sign}(\beta^*)_j = 1$ if $\beta^*_j > 0$, $-1$ if $\beta^*_j < 0$, and $0$ if $\beta^*_j=0$.
[/definition]
The support records which variables matter in the sparse model, while the sign vector also records the direction of each nonzero effect. To compare procedures across sample sizes, we need an asymptotic success criterion for this entire signed pattern, not just a deterministic description of the target.
[definition: Sign Consistency]
Consider a sequence of linear models indexed by $n$, with design $X_n\in\mathbb R^{n\times p_n}$, target vector $\beta^*=\beta^*_n\in\mathbb R^{p_n}$, support $S_n=\operatorname{supp}(\beta^*_n)$, and estimator $\hat{\beta}_n\in\mathbb R^{p_n}$. The probability is taken over the noise, and also over $X_n$ when the design is random. The estimator sequence is sign consistent for $\beta^*$ if
\begin{align*}
\mathbb P\left(\operatorname{sign}(\hat{\beta}_n) = \operatorname{sign}(\beta^*_n)\right) \to 1.
\end{align*}
[/definition]
Sign consistency is a probability statement over the noise, and in random-design models it is also over the design. Some applications only care whether the selected set of covariates is right, and not whether the sign of every selected coefficient is part of the reported target. This motivates model selection consistency as the support-only version of exact recovery.
[definition: Model Selection Consistency]
In the same sequence of models, an estimator sequence $\hat{\beta}_n$ is model selection consistent if
\begin{align*}
\mathbb P\left(\{j: \hat{\beta}_{n,j} \ne 0\}=S_n\right)\to 1.
\end{align*}
[/definition]
Model selection consistency ignores signs once the selected set is correct. The oracle inequalities from earlier chapters had a different goal: they controlled fitted values rather than the names of selected variables. This motivates prediction consistency as the baseline notion against which exact support recovery should be compared.
[definition: Prediction Consistency]
In the same sequence of models, an estimator sequence $\hat{\beta}_n$ is prediction consistent if
\begin{align*}
\frac{1}{n}|X_n(\hat{\beta}_n-\beta^*_n)|^2 \xrightarrow{\mathbb P} 0.
\end{align*}
[/definition]
Prediction consistency can hold in regimes where model selection consistency fails. The reason is that prediction sees the geometry of the column span, whereas support recovery sees the labelling of individual columns.
[example: Prediction Without Exact Recovery]
Suppose $X_1,X_2\in\mathbb R^n$ are nearly identical columns, and suppose the true vector satisfies $\beta^*_1=a$ and $\beta^*_2=0$. Compare $\beta^*$ with an estimator $\hat{\beta}$ that moves this coefficient from the first column to the second:
\begin{align*}
\hat{\beta}_1=0,\qquad \hat{\beta}_2=a,\qquad \hat{\beta}_j=\beta^*_j \text{ for } j\notin\{1,2\}.
\end{align*}
Then $\operatorname{supp}(\beta^*)$ contains $1$ and not $2$, while $\operatorname{supp}(\hat{\beta})$ contains $2$ and not $1$, so exact support recovery fails.
The fitted-vector error is obtained coordinate by coordinate:
\begin{align*}
X(\hat{\beta}-\beta^*)=X_1(\hat{\beta}_1-\beta^*_1)+X_2(\hat{\beta}_2-\beta^*_2)+\sum_{j\notin\{1,2\}}X_j(\hat{\beta}_j-\beta^*_j).
\end{align*}
Substituting the displayed values gives
\begin{align*}
X(\hat{\beta}-\beta^*)=X_1(0-a)+X_2(a-0)+\sum_{j\notin\{1,2\}}X_j\cdot 0.
\end{align*}
Hence
\begin{align*}
X(\hat{\beta}-\beta^*)=-aX_1+aX_2=a(X_2-X_1).
\end{align*}
Therefore the prediction loss is
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2=\frac{1}{n}|a(X_2-X_1)|^2=\frac{a^2}{n}|X_2-X_1|^2.
\end{align*}
Thus the selected variable can be wrong even when prediction loss is small, because prediction only sees the small column difference $X_2-X_1$.
[/example]
This example explains why the assumptions for support recovery must be stronger than the assumptions for oracle inequalities. Restricted eigenvalue conditions prevent sparse directions from vanishing, but exact selection also requires inactive variables not to be too well represented by the active variables.
## The Irrepresentable Condition and the Beta-Min Barrier
The Lasso chooses variables by balancing residual correlations against the threshold $\lambda$. To select the correct variables, the inactive columns must not be able to reproduce the signed contribution of the active columns after the shrinkage bias has been introduced.
Throughout this section write
\begin{align*}
\hat{\Sigma} := \frac{1}{n}X^\top X, \qquad S := \operatorname{supp}(\beta^*), \qquad s := |S|.
\end{align*}
Assume $\hat{\Sigma}_{SS}$ is invertible. The Lasso estimator is any minimiser of
\begin{align*}
\hat{\beta}\in \operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{1}{2n}|Y-X\beta|^2+\lambda\|\beta\|_1\right\}.
\end{align*}
[definition: Irrepresentable Condition]
The design satisfies the irrepresentable condition on $S$ with margin $\eta\in(0,1]$ if
\begin{align*}
\left\|\hat{\Sigma}_{S^cS}\hat{\Sigma}_{SS}^{-1}\operatorname{sign}(\beta^*_S)\right\|_\infty \le 1-\eta.
\end{align*}
[/definition]
The matrix product in this condition is the vector of inactive correlations induced by the signed active set after projecting through the active Gram matrix. This controls false positives, but it says nothing about whether the true nonzero coefficients are large enough to survive the Lasso's shrinkage. This motivates the beta-min condition as the active-coordinate separation needed to avoid false negatives.
[definition: Beta-Min Condition]
The vector $\beta^*$ satisfies a beta-min condition at level $b_n>0$ if
\begin{align*}
\min_{j\in S}|\beta^*_j| \ge b_n.
\end{align*}
[/definition]
The beta-min condition separates true but small coefficients from zero. Exact sign recovery has two distinct failure modes: an inactive variable may satisfy the dual KKT inequalities too tightly and enter the model, or a true coefficient may be shrunk through zero. The active-set construction below addresses both barriers by pairing irrepresentability for inactive coordinates with beta-min separation for active coordinates.
[quotetheorem:5566]
[citeproof:5566]
This theorem turns exact support recovery into two checkable inequalities: active coefficients must survive shrinkage and noise, and inactive dual variables must stay below the Lasso threshold. The invertibility of $\hat{\Sigma}_{SS}$ is needed because the restricted problem on the active variables must have a well-defined local linear solution; without it, different active coefficient vectors can produce the same fitted values, so the active signs are not identifiable even before considering inactive variables. The irrepresentable condition is the mechanism preventing false positives: if it fails badly, an inactive column can inherit almost the same signed residual correlation as the active set and enter the KKT system. The beta-min inequality is the complementary mechanism preventing false negatives: if it fails, a true coefficient can be shifted through zero by the shrinkage term and the coordinatewise noise. The theorem does not say that prediction accuracy implies support recovery, nor does it give recovery when the irrepresentable condition fails; it is a conditional witness construction showing that these two deterministic and stochastic barriers are enough for sign recovery on the event displayed above.
[example: Orthogonal Design Exact Recovery]
Assume $p\le n$ and $X^\top X/n=I_p$. Write $Z_j:=X_j^\top Y/n$. Since $Y=X\beta^*+\varepsilon$, the $j$th coordinate satisfies
\begin{align*}
Z_j=\frac{X_j^\top X\beta^*}{n}+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
Expanding $X\beta^*=\sum_{k=1}^p X_k\beta^*_k$ gives
\begin{align*}
\frac{X_j^\top X\beta^*}{n}=\sum_{k=1}^p \frac{X_j^\top X_k}{n}\beta^*_k.
\end{align*}
Because $X^\top X/n=I_p$, we have $X_j^\top X_k/n=1$ when $j=k$ and $0$ when $j\ne k$, so
\begin{align*}
Z_j=\beta^*_j+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
The Lasso objective separates by coordinates. Expanding the squared norm,
\begin{align*}
\frac{1}{2n}|Y-X\beta|^2=\frac{1}{2n}Y^\top Y-\beta^\top\frac{X^\top Y}{n}+\frac{1}{2}\beta^\top\left(\frac{X^\top X}{n}\right)\beta.
\end{align*}
Using $X^\top X/n=I_p$ and $Z=X^\top Y/n$, the terms depending on $\beta$ are
\begin{align*}
\sum_{j=1}^p\left(\frac{1}{2}\beta_j^2-Z_j\beta_j+\lambda|\beta_j|\right).
\end{align*}
Thus each coordinate minimises $q_j(b):=\frac{1}{2}b^2-Z_jb+\lambda|b|$. If $b>0$, then $q_j'(b)=b-Z_j+\lambda$, so the positive critical point is $b=Z_j-\lambda$, valid exactly when $Z_j>\lambda$. If $b<0$, then $q_j'(b)=b-Z_j-\lambda$, so the negative critical point is $b=Z_j+\lambda$, valid exactly when $Z_j<-\lambda$. At $b=0$, the subgradient condition is $0\in[-Z_j-\lambda,-Z_j+\lambda]$, equivalently $|Z_j|\le\lambda$. Therefore
\begin{align*}
\hat{\beta}_j=\operatorname{sign}(Z_j)(|Z_j|-\lambda)_+.
\end{align*}
The irrepresentable condition holds with margin $\eta=1$. Indeed, for $j\in S^c$ and $k\in S$, orthogonality gives $X_j^\top X_k/n=0$, so
\begin{align*}
\hat{\Sigma}_{S^cS}=0.
\end{align*}
Hence
\begin{align*}
\left\|\hat{\Sigma}_{S^cS}\hat{\Sigma}_{SS}^{-1}\operatorname{sign}(\beta^*_S)\right\|_\infty=0.
\end{align*}
Now suppose
\begin{align*}
\max_{1\le j\le p}\left|\frac{X_j^\top\varepsilon}{n}\right|<\lambda.
\end{align*}
Also suppose
\begin{align*}
\min_{j\in S}|\beta^*_j|>\lambda+\max_{j\in S}\left|\frac{X_j^\top\varepsilon}{n}\right|.
\end{align*}
If $j\notin S$, then $\beta^*_j=0$, so $Z_j=X_j^\top\varepsilon/n$ and therefore $|Z_j|<\lambda$. Soft-thresholding gives $\hat{\beta}_j=0$. If $j\in S$, then the triangle inequality gives
\begin{align*}
|Z_j|\ge |\beta^*_j|-\left|\frac{X_j^\top\varepsilon}{n}\right|.
\end{align*}
The beta-min display makes the right-hand side larger than $\lambda$, so $\hat{\beta}_j\ne0$. The same inequality also gives $|X_j^\top\varepsilon/n|<|\beta^*_j|$, so the perturbation cannot move $\beta^*_j$ across zero; hence $Z_j$ and $\beta^*_j$ have the same sign. Since soft-thresholding preserves the sign of every nonzero $Z_j$, we get $\operatorname{sign}(\hat{\beta})=\operatorname{sign}(\beta^*)$ on these two events.
Finally, under $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$, each $X_j^\top\varepsilon/n$ is Gaussian with mean $0$. Its variance is
\begin{align*}
\operatorname{Var}\left(\frac{X_j^\top\varepsilon}{n}\right)=\frac{\sigma^2}{n^2}|X_j|^2.
\end{align*}
Since $|X_j|^2/n=1$, this variance equals $\sigma^2/n$. For $t>0$, the Gaussian tail bound gives
\begin{align*}
\mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|\ge t\right)\le 2\exp\left(-\frac{nt^2}{2\sigma^2}\right).
\end{align*}
A union bound over $j=1,\dots,p$ gives
\begin{align*}
\mathbb P\left(\max_{1\le j\le p}\left|\frac{X_j^\top\varepsilon}{n}\right|\ge t\right)\le 2p\exp\left(-\frac{nt^2}{2\sigma^2}\right).
\end{align*}
Thus a threshold $\lambda$ of order $\sigma\sqrt{\log p/n}$ dominates all coordinate noise terms with high probability, and exact sign recovery additionally requires every active coefficient to lie above that threshold plus its own noise fluctuation.
[/example]
The orthogonal design shows the cleanest version of the theory: false positives are prevented by thresholding pure noise coordinates, while false negatives are prevented by a signal gap above the same stochastic scale. Correlation couples these two tasks and introduces the irrepresentable condition.
[example: Failure Under a Correlated Inactive Variable]
Let $S=\{1\}$, $\beta^*_1>0$, and suppose the inactive column $X_2$ satisfies $X_2^\top X_1/n=\rho$ with $\rho$ positive and close to $1$. Since the columns are normalised,
\begin{align*}
\hat{\Sigma}_{SS}=\left(\frac{X_1^\top X_1}{n}\right)=(1).
\end{align*}
Thus
\begin{align*}
\hat{\Sigma}_{SS}^{-1}=(1)
\end{align*}
and, because $\beta^*_1>0$,
\begin{align*}
\operatorname{sign}(\beta^*_S)=1.
\end{align*}
For the inactive coordinate $2$, the deterministic irrepresentable term is
\begin{align*}
\hat{\Sigma}_{2S}\hat{\Sigma}_{SS}^{-1}\operatorname{sign}(\beta^*_S)=\left(\frac{X_2^\top X_1}{n}\right)(1)(1)=\rho.
\end{align*}
Therefore the irrepresentable condition would require $|\rho|\le 1-\eta$. When $\rho$ is close to $1$, the inactive coordinate is already within the small gap $1-\rho$ of the KKT boundary.
The corresponding residual-noise term for coordinate $2$ is
\begin{align*}
R_2=\frac{X_2^\top}{n}\left(I-X_1\left(\frac{X_1^\top X_1}{n}\right)^{-1}\frac{X_1^\top}{n}\right)\varepsilon.
\end{align*}
Multiplying out the two terms gives
\begin{align*}
R_2=\frac{X_2^\top\varepsilon}{n}-\frac{X_2^\top X_1}{n}\left(\frac{X_1^\top X_1}{n}\right)^{-1}\frac{X_1^\top\varepsilon}{n}.
\end{align*}
Using $X_2^\top X_1/n=\rho$ and $X_1^\top X_1/n=1$, this becomes
\begin{align*}
R_2=\frac{X_2^\top\varepsilon}{n}-\rho\frac{X_1^\top\varepsilon}{n}.
\end{align*}
Hence the inactive KKT coordinate for variable $2$ is
\begin{align*}
\rho+\frac{1}{\lambda}\left(\frac{X_2^\top\varepsilon}{n}-\rho\frac{X_1^\top\varepsilon}{n}\right).
\end{align*}
If the noise fluctuation satisfies
\begin{align*}
\frac{X_2^\top\varepsilon}{n}-\rho\frac{X_1^\top\varepsilon}{n}>\lambda(1-\rho),
\end{align*}
then
\begin{align*}
\rho+\frac{1}{\lambda}\left(\frac{X_2^\top\varepsilon}{n}-\rho\frac{X_1^\top\varepsilon}{n}\right)>\rho+(1-\rho)=1.
\end{align*}
On this event the inactive KKT inequality cannot hold with $\hat{\beta}_2=0$. Thus a highly correlated inactive column needs only a fluctuation larger than the small gap $\lambda(1-\rho)$ to cross the Lasso threshold, so exact support recovery is unstable when $\rho$ is close to $1$.
[/example]
This failure is not a defect of the proof technique. If an inactive covariate is too aligned with the signed active effect, the penalised problem cannot reliably distinguish the two roles from finite data at the Lasso threshold.
## Thresholding the Lasso
The Lasso often includes extra variables even when its prediction error is small. A natural second step is to discard coefficients whose absolute values are below a deterministic threshold, separating estimation from selection.
[definition: Thresholded Lasso Support]
Given a Lasso estimator $\hat{\beta}$ and a threshold $\tau>0$, the thresholded support is
\begin{align*}
\hat{S}_\tau := \{j\in\{1,\dots,p\}: |\hat{\beta}_j|>\tau\}.
\end{align*}
[/definition]
Thresholding is useful when the Lasso estimation error is controlled in the $\ell_\infty$ norm. The selection problem then reduces to a gap question: inactive coordinates must remain below the threshold despite estimation error, while active coordinates must remain above it after the same possible error. The deterministic separation calculation below records exactly how wide that gap must be.
[quotetheorem:5567]
[citeproof:5567]
This theorem is deliberately abstract: the work is in proving the $\ell_\infty$ error bound under a design condition. The event $\|\hat{\beta}-\beta^*\|_\infty\le r_n$ is indispensable for this deterministic argument, because without it an inactive coordinate can lie above the threshold or an active coordinate can be estimated too close to zero. The threshold range is also sharp in its role: $\tau\ge r_n$ is what removes all inactive coordinates on the event, while $\tau<\min_{j\in S}|\beta^*_j|-r_n$ is what retains every active coordinate. For example, if $\beta^*_j=3r_n/2$ on an active coordinate and the estimation error equals $-r_n$ with the same sign convention, then $|\hat{\beta}_j|=r_n/2$, so no threshold large enough to remove all inactive coordinates with errors of size $r_n$ can also retain this active coordinate. The theorem does not guarantee that the Lasso itself has small $\ell_\infty$ error, nor does it repair a design in which two columns are statistically indistinguishable. Its role is to separate the selection step from the estimation step: later results can focus on proving coordinatewise error bounds, and this deterministic lemma then converts those bounds into exact support recovery.
[example: Thresholding Removes Small False Positives]
Suppose $\|\hat{\beta}-\beta^*\|_\infty\le C\lambda$ and
\begin{align*}
\min_{j\in S}|\beta^*_j|>3C\lambda.
\end{align*}
Take the threshold $\tau=2C\lambda$. If $j\notin S$, then $\beta^*_j=0$, so
\begin{align*}
|\hat{\beta}_j|=|\hat{\beta}_j-\beta^*_j|.
\end{align*}
By the definition of the supremum norm,
\begin{align*}
|\hat{\beta}_j-\beta^*_j|\le \|\hat{\beta}-\beta^*\|_\infty\le C\lambda.
\end{align*}
Since $C\lambda<2C\lambda=\tau$, this gives
\begin{align*}
|\hat{\beta}_j|<\tau.
\end{align*}
Thus no inactive coordinate belongs to $\hat{S}_\tau=\{j:|\hat{\beta}_j|>\tau\}$.
If $j\in S$, then the [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
|\hat{\beta}_j|\ge |\beta^*_j|-|\hat{\beta}_j-\beta^*_j|.
\end{align*}
Using the supremum-norm error bound,
\begin{align*}
|\hat{\beta}_j-\beta^*_j|\le \|\hat{\beta}-\beta^*\|_\infty\le C\lambda.
\end{align*}
Therefore
\begin{align*}
|\hat{\beta}_j|>3C\lambda-C\lambda=2C\lambda=\tau.
\end{align*}
Thus every active coordinate belongs to $\hat{S}_\tau$. Thresholding at $\tau=2C\lambda$ removes all inactive coordinates and keeps all active coordinates, so $\hat{S}_\tau=S$ on this supremum-norm error event.
[/example]
Thresholding changes the final selected set but does not repair a fundamentally non-identifiable design. If an inactive variable and an active variable are nearly exchangeable, no coordinatewise cutoff can decide which name should be kept without enough information in the data.
## False Positives, False Negatives, and Exact Recovery Limits
Support recovery errors have two different statistical meanings. A false positive means the procedure reports a variable outside the true support, while a false negative means it misses a variable whose coefficient is nonzero. The Lasso tuning parameter moves these errors in opposite directions.
[definition: False Positives and False Negatives]
For an estimated support $\hat{S}\subset\{1,\dots,p\}$ and true support $S$, the false positive set and false negative set are
\begin{align*}
\operatorname{FP} := \hat{S}\setminus S, \qquad \operatorname{FN} := S\setminus \hat{S}.
\end{align*}
[/definition]
A larger $\lambda$ suppresses false positives by demanding stronger residual correlations before a variable enters the model. The same larger $\lambda$ increases shrinkage bias, so weak true variables are more likely to be removed.
[remark: Tuning Trade-Off]
Prediction-optimal tuning and selection-optimal tuning need not coincide. Prediction may prefer a smaller penalty that admits some harmless extra variables, while exact support recovery needs a penalty large enough to dominate the maximum inactive noise correlation. Selection also requires the nonzero coefficients to sit above the resulting bias and noise level.
[/remark]
The tuning trade-off is the practical face of two structural barriers. The next statement records these barriers directly from the KKT equations, separating the active-coordinate obstruction from the inactive-coordinate obstruction.
[quotetheorem:5568]
[citeproof:5568]
This theorem gives a necessary KKT diagnosis rather than a sufficient recovery theorem. The active condition explains the beta-min barrier: if an active coefficient is of the same size as the coordinatewise noise $W_S$ and the shrinkage vector $\lambda\hat{\Sigma}_{SS}^{-1}s$, then a sign flip or thresholding to zero can occur even when no inactive variable is selected. In the orthogonal one-variable case, $\hat{\beta}_1=\operatorname{sign}(\beta^*_1+X_1^\top\varepsilon/n)(|\beta^*_1+X_1^\top\varepsilon/n|-\lambda)_+$, so a coefficient below the threshold is missed whenever the noise does not push it far enough away from zero. The inactive condition explains the irrepresentable barrier: if $\|\hat{\Sigma}_{S^cS}\hat{\Sigma}_{SS}^{-1}s\|_\infty>1$, then even the noiseless dual calculation lies outside the KKT box, and small stochastic fluctuations cannot restore recovery with high probability. The theorem does not claim that these necessary conditions alone guarantee recovery; the primal-dual witness theorem supplied the matching sufficient direction under strict margins and beta-min separation. The following examples isolate the two failure modes in the orthogonal design, where the algebra reduces to soft-thresholding.
[example: False Negative from a Small Signal]
In the orthogonal design $X^\top X/n=I_p$, the Lasso coordinate formula is
\begin{align*}
\hat{\beta}_j=\operatorname{sign}(Z_j)(|Z_j|-\lambda)_+,\qquad Z_j=\frac{X_j^\top Y}{n}.
\end{align*}
Let $\beta^*_1=\lambda/2$ and write $\xi_1:=X_1^\top\varepsilon/n$. Since $Y=X\beta^*+\varepsilon$ and $X^\top X/n=I_p$,
\begin{align*}
Z_1=\frac{X_1^\top X\beta^*}{n}+\frac{X_1^\top\varepsilon}{n}.
\end{align*}
Orthogonality gives $X_1^\top X_k/n=0$ for $k\ne1$ and $X_1^\top X_1/n=1$, so
\begin{align*}
\frac{X_1^\top X\beta^*}{n}=\sum_{k=1}^p \frac{X_1^\top X_k}{n}\beta^*_k=\beta^*_1.
\end{align*}
Therefore
\begin{align*}
Z_1=\beta^*_1+\xi_1=\frac{\lambda}{2}+\xi_1.
\end{align*}
On the event $|\xi_1|\le \lambda/2$, we have $-\lambda/2\le \xi_1\le \lambda/2$, hence
\begin{align*}
0\le \frac{\lambda}{2}+\xi_1\le \lambda.
\end{align*}
Thus $|Z_1|\le\lambda$, and the soft-thresholding formula gives
\begin{align*}
\hat{\beta}_1=\operatorname{sign}(Z_1)(|Z_1|-\lambda)_+=0.
\end{align*}
But $\beta^*_1=\lambda/2\ne0$, so coordinate $1$ is in the true support and is missed by the Lasso.
If all other coordinates are estimated correctly and only coordinate $1$ is dropped, then
\begin{align*}
X(\hat{\beta}-\beta^*)=X_1\left(0-\frac{\lambda}{2}\right)=-\frac{\lambda}{2}X_1.
\end{align*}
Using $|X_1|^2/n=1$ from the orthogonal normalisation,
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2=\frac{1}{n}\left|-\frac{\lambda}{2}X_1\right|^2=\frac{\lambda^2}{4n}|X_1|^2=\frac{\lambda^2}{4}.
\end{align*}
Thus an active coefficient below the Lasso threshold can be removed even though its entire squared prediction contribution is only $\lambda^2/4$.
[/example]
A false negative can therefore be caused by a weak active coefficient even in the absence of design correlation. The complementary error comes from an inactive coordinate whose noise correlation crosses the selection threshold.
[example: False Positive from an Inactive Noise Spike]
Assume $X^\top X/n=I_p$, and write $\xi_j:=X_j^\top\varepsilon/n$. For an inactive coordinate $j\notin S$, we have $\beta^*_j=0$, so
\begin{align*}
Z_j:=\frac{X_j^\top Y}{n}
=\frac{X_j^\top X\beta^*}{n}+\frac{X_j^\top\varepsilon}{n}
=\sum_{k=1}^p \frac{X_j^\top X_k}{n}\beta^*_k+\xi_j
=0+\xi_j
=\xi_j.
\end{align*}
In the orthogonal design the coordinatewise Lasso solution is
\begin{align*}
\hat{\beta}_j=\operatorname{sign}(Z_j)(|Z_j|-\lambda)_+,
\end{align*}
so for $j\notin S$,
\begin{align*}
\hat{\beta}_j\ne0
\quad\Longleftrightarrow\quad
(|\xi_j|-\lambda)_+>0
\quad\Longleftrightarrow\quad
|\xi_j|>\lambda.
\end{align*}
Thus a false positive occurs whenever at least one inactive noise correlation exceeds the threshold:
\begin{align*}
\max_{j\in S^c}|\xi_j|>\lambda.
\end{align*}
Under $\varepsilon\sim\mathcal N(0,\sigma^2I_n)$, each $\xi_j$ is Gaussian with mean $0$ and variance
\begin{align*}
\operatorname{Var}(\xi_j)
=\operatorname{Var}\left(\frac{X_j^\top\varepsilon}{n}\right)
=\frac{\sigma^2}{n^2}|X_j|^2
=\frac{\sigma^2}{n},
\end{align*}
because $|X_j|^2/n=1$. If $m:=|S^c|$ and the inactive columns are orthogonal, then the variables $\{\xi_j:j\in S^c\}$ are independent, since for $j\ne k$,
\begin{align*}
\operatorname{Cov}(\xi_j,\xi_k)
=\frac{\sigma^2}{n^2}X_j^\top X_k
=0.
\end{align*}
Therefore
\begin{align*}
\mathbb P\left(\max_{j\in S^c}|\xi_j|\le \lambda\right)
=\prod_{j\in S^c}\mathbb P(|\xi_j|\le\lambda)
=\left(1-\mathbb P(|\xi_1|>\lambda)\right)^m.
\end{align*}
Writing $G\sim\mathcal N(0,1)$ gives
\begin{align*}
\mathbb P(|\xi_1|>\lambda)
=\mathbb P\left(|G|>\frac{\sqrt n\,\lambda}{\sigma}\right).
\end{align*}
For example, if $\lambda=c\sigma\sqrt{\log m/n}$ with $0<c<\sqrt2$, then
\begin{align*}
\mathbb P(|\xi_1|>\lambda)
=\mathbb P(|G|>c\sqrt{\log m}).
\end{align*}
The standard Gaussian lower tail bound
\begin{align*}
\mathbb P(G>x)\ge \frac{1}{\sqrt{2\pi}}\frac{x}{1+x^2}e^{-x^2/2},\qquad x>0,
\end{align*}
gives
\begin{align*}
m\,\mathbb P(|\xi_1|>\lambda)
\ge
\frac{2m}{\sqrt{2\pi}}
\frac{c\sqrt{\log m}}{1+c^2\log m}
e^{-c^2\log m/2}
=
\frac{2c}{\sqrt{2\pi}}
\frac{\sqrt{\log m}}{1+c^2\log m}
m^{1-c^2/2}.
\end{align*}
Since $1-c^2/2>0$, the right-hand side tends to infinity. Using $(1-u)^m\le e^{-mu}$ with $u=\mathbb P(|\xi_1|>\lambda)$, we get
\begin{align*}
\mathbb P\left(\max_{j\in S^c}|\xi_j|\le \lambda\right)
\le
\exp\left(-m\,\mathbb P(|\xi_1|>\lambda)\right)
\to 0.
\end{align*}
Hence
\begin{align*}
\mathbb P\left(\max_{j\in S^c}\left|\frac{X_j^\top\varepsilon}{n}\right|>\lambda\right)\to1.
\end{align*}
So when the threshold is below the scale $\sigma\sqrt{\log |S^c|/n}$, and hence below the usual $\sigma\sqrt{\log p/n}$ scale up to constants, inactive coordinates cross the Lasso threshold with substantial probability; these crossings are false positives.
[/example]
The chapter's conclusion is that support recovery is possible, but only in a narrower regime than prediction consistency. The design must prevent inactive variables from representing the signed active set, the tuning parameter must dominate the maximum inactive noise fluctuation, and the nonzero coefficients must be large enough to survive the resulting threshold. This mirrors the separation-versus-noise logic in multiple testing and sparse signal detection: exact identification requires not only small aggregate error, but enough coordinatewise separation to name the nonzero variables.
Support recovery reveals the limitations of pure ℓ1 regularisation: coefficients must be large enough to survive the thresholding effect, and strong correlations can make selection unstable. The next chapter therefore broadens the toolkit to penalties such as the elastic net and adaptive Lasso, which trade off sparsity, grouping, and bias in different ways.
# 6. Elastic Net, Adaptive Lasso, and Structured Penalties
After Chapters 2 through 5 developed the Lasso's optimisation, risk, and selection theory, this chapter studies what happens once those results have identified both the power and the limitations of pure $\ell_1$ regularisation. The Lasso promotes sparsity, but it can behave unstably under severe collinearity, it shrinks large coefficients by the same absolute amount as small ones, and it treats every coordinate as an unrelated object. Elastic net, adaptive Lasso, and structured penalties are three responses to these issues: they keep the computational advantages of convex regularisation while altering the geometry of the penalty to better match the design.
The guiding questions are: how can a penalty stabilise selection when predictors are strongly correlated, how can different coordinates be penalised at different strengths, and how can known group structure be used without abandoning sparsity? We keep the linear model notation from the previous chapters: $Y = X\beta^* + \varepsilon$, where $X \in \mathbb R^{n \times p}$ is fixed unless stated otherwise, $Y \in \mathbb R^n$, and the columns $X_j$ are normalised so that $|X_j|^2/n = 1$.
## Elastic Net and Collinearity
The Lasso penalty has corners that induce sparsity, but it does not strongly regularise movement along nearly collinear directions. If two columns of $X$ are almost the same, then replacing coefficient mass on one column by coefficient mass on the other barely changes the fitted values. This creates instability in the active set even when prediction is stable.
The elastic net answers this by adding a quadratic penalty to the Lasso objective. The quadratic part discourages large coefficient norms along flat directions of the empirical loss, while the $\ell_1$ part keeps coordinate sparsity.
[definition: Elastic Net Estimator]
Let $\lambda_1 \ge 0$ and $\lambda_2 \ge 0$. The elastic net estimator is any solution
\begin{align*}
\hat{\beta}^{\mathrm{EN}} \in \operatorname*{arg\,min}_{\beta \in \mathbb R^p}
\left\{\frac{1}{2n}|Y - X\beta|^2 + \lambda_1\|\beta\|_1 + \frac{\lambda_2}{2}\,|\beta|^2\right\}.
\end{align*}
[/definition]
When $\lambda_2 = 0$ this is the Lasso. When $\lambda_1 = 0$ this is ridge regression. The mixed penalty therefore interpolates between sparse coordinate selection and quadratic stabilisation.
[remark: Scaling of the Quadratic Term]
Some texts write the elastic net penalty as $\lambda\{\alpha\|\beta\|_1 + (1-\alpha)|\beta|^2/2\}$ with $\alpha \in [0,1]$. This is the same family after setting $\lambda_1 = \lambda\alpha$ and $\lambda_2 = \lambda(1-\alpha)$.
[/remark]
This parametrisation is useful for tuning, but it does not yet say how to recognise a fitted elastic net solution. The next step is to derive the optimality equations, because those equations will explain both sparsity and the stabilising role of the ridge term.
[explanation: Elastic Net KKT Conditions]
Let $\lambda_1,\lambda_2\ge 0$, and define
\begin{align*}
F(\beta)=\frac{1}{2n}|Y-X\beta|^2+\lambda_1\|\beta\|_1+\frac{\lambda_2}{2}|\beta|^2.
\end{align*}
A vector $\hat{\beta}\in\mathbb R^p$ minimizes $F$ if and only if there is a vector $z\in\mathbb R^p$ such that
\begin{align*}
\frac{1}{n}X^\top(Y-X\hat{\beta})=\lambda_1 z+\lambda_2\hat{\beta}.
\end{align*}
The coordinates of $z$ satisfy
\begin{align*}
z_j=\operatorname{sgn}(\hat{\beta}_j)
\end{align*}
when $\hat{\beta}_j\ne 0$, and
\begin{align*}
z_j\in[-1,1]
\end{align*}
when $\hat{\beta}_j=0$. This is the usual Lasso subgradient condition with the additional ridge-gradient term $\lambda_2\hat{\beta}$.
[/explanation]
These equations say that inactive variables are still controlled by residual correlations, but active variables must balance residual correlation against both an $\ell_1$ sign term and a ridge shrinkage term. The hypotheses $\lambda_1\ge 0$ and $\lambda_2\ge 0$ are what make this a convex optimality statement rather than merely a formal stationarity equation. If $\lambda_1<0$, even the case $X=0$ rewards making $\|\beta\|_1$ large, so a minimiser need not exist. If $\lambda_2<0$, the quadratic penalty bends the objective downwards; for instance with $X=0$ the objective is unbounded below along any nonzero direction. Thus the KKT equations are useful because they are necessary and sufficient for a convex problem, and the ridge term then has a genuine stabilising interpretation along nearly flat directions of the loss.
[example: Two Correlated Predictors]
Suppose $p=2$, $|X_1|^2/n=|X_2|^2/n=1$, and $X_1^\top X_2/n=\rho$. Assume both elastic net coefficients are nonzero and have the same sign, and write $s=\operatorname{sgn}(\hat{\beta}_1)=\operatorname{sgn}(\hat{\beta}_2)$. The active KKT equations are
\begin{align*}
\frac{1}{n}X_1^\top(Y-X\hat{\beta})=\lambda_1s+\lambda_2\hat{\beta}_1
\end{align*}
and
\begin{align*}
\frac{1}{n}X_2^\top(Y-X\hat{\beta})=\lambda_1s+\lambda_2\hat{\beta}_2.
\end{align*}
Subtracting the second equation from the first cancels the common $\lambda_1s$ term:
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top(Y-X\hat{\beta})=\lambda_2(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
Since $X\hat{\beta}=X_1\hat{\beta}_1+X_2\hat{\beta}_2$, the fitted-value part is
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X\hat{\beta}=\frac{1}{n}(X_1-X_2)^\top X_1\hat{\beta}_1+\frac{1}{n}(X_1-X_2)^\top X_2\hat{\beta}_2.
\end{align*}
Using $|X_1|^2/n=|X_2|^2/n=1$ and $X_1^\top X_2/n=X_2^\top X_1/n=\rho$, this becomes
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X\hat{\beta}=(1-\rho)\hat{\beta}_1+(\rho-1)\hat{\beta}_2.
\end{align*}
Equivalently,
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top X\hat{\beta}=(1-\rho)(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
Substituting this expansion into the subtracted KKT equation gives
\begin{align*}
\frac{1}{n}(X_1-X_2)^\top Y-(1-\rho)(\hat{\beta}_1-\hat{\beta}_2)=\lambda_2(\hat{\beta}_1-\hat{\beta}_2).
\end{align*}
Moving the coefficient-difference terms to the same side yields
\begin{align*}
(1-\rho+\lambda_2)(\hat{\beta}_1-\hat{\beta}_2)=\frac{1}{n}(X_1-X_2)^\top Y.
\end{align*}
Thus the split between two same-sign active coefficients is controlled by the denominator $1-\rho+\lambda_2$; the ridge term prevents this denominator from collapsing when $\rho$ is close to $1$.
[/example]
This two-variable computation suggests a precise grouping phenomenon. The instability to control is the arbitrary splitting of fitted signal between nearly duplicated predictors: with a pure Lasso penalty, two highly correlated columns can give nearly the same fitted values while receiving quite different coefficients. The ridge part of the elastic net adds curvature in the coefficient difference direction, so same-sign selected predictors with high correlation are forced to track one another.
[quotetheorem:5569]
[citeproof:5569]
The grouping effect should not be read as a variable-selection guarantee. It is a stability statement about fitted coefficients under collinearity, and each hypothesis is doing visible work. The same-sign assumption makes the two $\lambda_1$ sign terms cancel; if the fitted coefficients have opposite signs, subtracting the KKT equations doubles the sign contribution instead of removing it, so there is no reason for the two coefficient values to be close. The condition $\lambda_2>0$ prevents the denominator from degenerating when $\rho\to 1$; with pure Lasso and nearly duplicated predictors, the fitted value can be stable while the split between the two coefficients is unstable. If $\rho$ is not close to $1$, the theorem still gives a valid bound, but it no longer expresses a strong grouping phenomenon because the predictors are not nearly interchangeable. In applications with duplicated or nearly duplicated features, the elastic net behaviour is often preferable to the Lasso's tendency to select one representative from a correlated cluster.
[example: Correlated Polynomial Features]
Let $x_1$ be the centred and normalised column formed from $t$, and let $x_3$ be the centred and normalised column formed from $t^3$, so $|x_1|^2/n=|x_3|^2/n=1$. Their empirical correlation is
\begin{align*}
\rho
=
\frac{x_1^\top x_3}{n}
=
\frac{\sum_{i=1}^n (t_i-\bar t)(t_i^3-\overline{t^3})}
{\left(\sum_{i=1}^n (t_i-\bar t)^2\right)^{1/2}
\left(\sum_{i=1}^n (t_i^3-\overline{t^3})^2\right)^{1/2}}.
\end{align*}
If the observed $t_i$ values lie in a narrow region where the centred cubic column is close to a positive scalar multiple of the centred linear column, then $\rho$ is close to $1$. For example, if $t_i\in\{-a,a\}$ with both signs represented equally, then $\bar t=0$, $\overline{t^3}=0$, and $t_i^3=a^2t_i$, so the two centred columns become identical after normalisation and $\rho=1$.
Now compare the same-sign active coefficients on $x_1$ and $x_3$ in an elastic net fit. Writing them as $\hat\beta_1$ and $\hat\beta_3$, the two-predictor KKT subtraction gives
\begin{align*}
(1-\rho+\lambda_2)(\hat\beta_1-\hat\beta_3)
=
\frac{1}{n}(x_1-x_3)^\top Y.
\end{align*}
Hence
\begin{align*}
\hat\beta_1-\hat\beta_3
=
\frac{(x_1-x_3)^\top Y/n}{1-\rho+\lambda_2}.
\end{align*}
For the pure Lasso, $\lambda_2=0$, so the denominator is $1-\rho$; when $\rho$ is close to $1$, a small change in $(x_1-x_3)^\top Y/n$ can produce a large change in the split between $\hat\beta_1$ and $\hat\beta_3$. With $\lambda_2>0$, the denominator is $1-\rho+\lambda_2$, so the ridge part keeps the coefficient split from being governed only by the nearly flat direction $x_1-x_3$. Thus an elastic net fit can keep both polynomial terms active and stabilise their combined fitted contribution, which is why polynomial and spline dictionaries often benefit from a ridge component.
[/example]
## Adaptive Lasso and Weighted Coordinate Penalties
The Lasso penalises all coordinates at the same rate, which is a poor match for situations where preliminary evidence suggests that some coordinates are likely to carry signal and others are likely to be noise. Equal penalisation also creates persistent shrinkage bias for large coefficients. The adaptive Lasso modifies the $\ell_1$ penalty by attaching data-dependent weights to the coordinates.
[definition: Weighted Lasso]
Let $w = (w_1,\dots,w_p) \in (0,\infty)^p$ and $\lambda \ge 0$. A weighted Lasso estimator is any solution
\begin{align*}
\hat{\beta}^{w} \in \operatorname*{arg\,min}_{\beta \in \mathbb R^p}
\left\{\frac{1}{2n}|Y-X\beta|^2 + \lambda\sum_{j=1}^p w_j|\beta_j|\right\}.
\end{align*}
[/definition]
The weighted problem is still convex when the weights are treated as fixed. Coordinates with large $w_j$ are penalised more strongly; coordinates with small $w_j$ are penalised less strongly.
[quotetheorem:5570]
[citeproof:5570]
The KKT equations show that the inactive threshold is no longer common across coordinates: coordinate $j$ faces threshold $\lambda w_j$. The assumptions $w_j>0$ and $\lambda\ge 0$ are essential for this to remain a convex weighted shrinkage problem. If some $w_j=0$, coordinate $j$ is unpenalised and can enter freely, so the KKT inequality no longer imposes a selection threshold on that coordinate. If a weight is infinite, the finite-dimensional optimisation problem is no longer the stated convex problem unless one imposes the extra constraint $\beta_j=0$. Negative weights would reward large coefficients and can make the objective unbounded below in directions weakly controlled by the loss. Thus weighted Lasso is best understood as ordinary Lasso after rescaling the coordinate thresholds, and this raises the central design question for adaptive regularisation: how should the weights be chosen so that likely signals are protected while likely noise coordinates are suppressed?
[definition: Adaptive Lasso]
Let $\tilde{\beta} \in \mathbb R^p$ be an initial estimator and let $\gamma > 0$. For a small stabilisation constant $a_n \ge 0$, define
\begin{align*}
w_j = (|\tilde{\beta}_j| + a_n)^{-\gamma}, \qquad j=1,\dots,p.
\end{align*}
The adaptive Lasso estimator is the weighted Lasso estimator with these weights.
[/definition]
The stabilisation constant avoids infinite weights when a pilot coefficient is exactly zero. In low-dimensional theory it is often set to $0$ under a root-$n$ consistent pilot estimator with nonzero true active coefficients; in high-dimensional settings it is safer to keep $a_n>0$ unless the pilot is known to preserve the relevant support.
[example: Weighted Penalties from Pilot Estimates]
In a regression with $p$ moderate relative to $n$, take $\tilde{\beta}$ to be the ridge pilot and choose
\begin{align*}
w_j=(|\tilde{\beta}_j|+n^{-1/2})^{-1}.
\end{align*}
Compare two coordinates with pilot magnitudes $0.4$ and $0.01$. Their weights are
\begin{align*}
w_{\mathrm{large}}=\frac{1}{0.4+n^{-1/2}},
\qquad
w_{\mathrm{small}}=\frac{1}{0.01+n^{-1/2}}.
\end{align*}
Since $0.01+n^{-1/2}<0.4+n^{-1/2}$, taking reciprocals of positive numbers gives
\begin{align*}
w_{\mathrm{small}}>w_{\mathrm{large}}.
\end{align*}
The size of the separation is
\begin{align*}
\frac{w_{\mathrm{small}}}{w_{\mathrm{large}}}
=
\frac{1/(0.01+n^{-1/2})}{1/(0.4+n^{-1/2})}
=
\frac{0.4+n^{-1/2}}{0.01+n^{-1/2}}.
\end{align*}
Thus, as $n$ grows and $n^{-1/2}$ becomes small, this ratio approaches
\begin{align*}
\frac{0.4}{0.01}=40.
\end{align*}
In the weighted Lasso objective, coordinate $j$ is penalised through the term
\begin{align*}
\lambda w_j|\beta_j|.
\end{align*}
Therefore the coordinate with pilot magnitude $0.01$ receives the larger effective penalty $\lambda w_{\mathrm{small}}$, while the coordinate with pilot magnitude $0.4$ receives the smaller effective penalty $\lambda w_{\mathrm{large}}$. The pilot estimator is used only to set these fixed weights; after the weights are chosen, the adaptive fit is still the convex optimisation problem
\begin{align*}
\operatorname*{minimise}_{\beta\in\mathbb R^p}
\left\{
\frac{1}{2n}|Y-X\beta|^2
+
\lambda\sum_{j=1}^p w_j|\beta_j|
\right\}.
\end{align*}
Larger pilot coefficients are therefore protected from shrinkage more than smaller pilot coefficients, while the final selected coefficients are determined by the second-stage penalised least-squares fit.
[/example]
The example illustrates the intended separation between large and small pilot coefficients. The next theorem explains why, under an asymptotic regime where this separation is reliable, the adaptive Lasso can remove inactive variables while leaving the active variables with the same first-order distribution as least squares on the true support.
[quotetheorem:5571]
[citeproof:5571]
This theorem is a low-dimensional oracle statement, and its hypotheses describe exactly where the oracle conclusion can fail. Fixed $p$ keeps the random score vector at the usual $n^{-1/2}$ scale; when $p$ grows, the largest inactive score is typically larger and the same tuning rates no longer control all noise variables at once. Root-$n$ pilot consistency separates active and inactive coordinates: if an active coefficient is estimated as nearly zero by the pilot, its adaptive weight becomes large and the second stage can delete a real signal. The nonzero-coefficient assumption inside $S$ is also a beta-separation condition in disguise, since coefficients drifting to zero cannot be distinguished from inactive coordinates at the required scale. The two tuning conditions pull in opposite directions: $\lambda_n\sqrt n\to 0$ removes first-order bias on $S$, while $\lambda_n n^{(1+\gamma)/2}\to\infty$ makes the inactive penalty dominate the noise score. If the first condition fails, the limiting distribution is biased; if the second fails, false positives need not disappear.
Weighted penalties can reduce bias on a candidate active set only when the pilot estimator ranks active and inactive coordinates well enough. If the pilot misses a real signal and assigns it a large weight, the adaptive stage can remove it permanently.
[remark: High-Dimensional Adaptive Lasso]
When $p$ grows with $n$, the same idea requires stronger assumptions: a pilot estimator with suitable sup-norm control, beta-min conditions separating active coefficients from zero, and restricted eigenvalue or compatibility conditions for the weighted design. The adaptive Lasso is therefore best understood as a second-stage procedure whose success depends on the quality of the first stage.
[/remark]
## Group Lasso and Sparse-Group Penalties
Many designs arrive with structure before any data are observed. Dummy variables from one categorical predictor form a block, basis functions for one covariate form a block, and interactions may be organised by factor or by hierarchy. Coordinatewise sparsity ignores this structure, so the next question is how to select or remove whole groups while still using convex optimisation.
[definition: Group Partition]
A group partition of $\{1,\dots,p\}$ is a collection $\mathcal G = \{G_1,\dots,G_M\}$ of nonempty, pairwise disjoint subsets of $\{1,\dots,p\}$ such that
\begin{align*}
\bigcup_{m=1}^M G_m = \{1,\dots,p\}.
\end{align*}
For $G \in \mathcal G$, write $\beta_G = (\beta_j)_{j\in G}$ and $X_G$ for the submatrix of $X$ with columns indexed by $G$.
[/definition]
A partition tells us which coordinates should be treated as a single modelling unit, but it does not yet impose selection. To select whole units, the penalty must have a corner when an entire vector $\beta_G$ is zero, which leads to a sum of Euclidean norms over groups.
[definition: Group Lasso Estimator]
Let $\mathcal G$ be a group partition and let $a_G>0$ be group weights. The group Lasso estimator is any solution
\begin{align*}
\hat{\beta}^{\mathrm{GL}} \in \operatorname*{arg\,min}_{\beta\in\mathbb R^p}
\left\{\frac{1}{2n}|Y-X\beta|^2 + \lambda\sum_{G\in\mathcal G} a_G |\beta_G|\right\}.
\end{align*}
[/definition]
The usual choice $a_G=\sqrt{|G|}$ compensates for the fact that larger groups have larger Euclidean norms under noise. To understand how this penalty selects a block, we again turn to KKT conditions, now using the subdifferential of the Euclidean norm at the origin.
[quotetheorem:5572]
[citeproof:5572]
The KKT condition gives the selection rule: an inactive group remains inactive when its residual score vector has Euclidean norm at most $\lambda a_G$. This is the block analogue of the Lasso residual-correlation threshold, but the hypotheses determine what the block statement means. The partition assumption ensures that each coefficient is penalised exactly once; with overlapping groups, the same coefficient can be charged through several norms and the KKT equations require extra latent variables or duplicated-coordinate formulations. The positivity of $a_G$ prevents a group from being accidentally unpenalised: if $a_G=0$, the corresponding block has no selection threshold, while a negative weight would destroy convexity of the penalty. Thus the theorem is a clean block-selection result for disjoint groups with positive weights, and it prepares the way for sparse-group penalties, where the model needs both group-level deletion and coordinate-level deletion inside active groups.
[example: Grouped Categorical Predictors]
Suppose the factor has levels $0,1,2,3$, with level $0$ chosen as the baseline. Define three dummy columns by
\begin{align*}
D_{i1}=\mathbf 1\{\text{observation }i\text{ has level }1\},\qquad
D_{i2}=\mathbf 1\{\text{observation }i\text{ has level }2\},\qquad
D_{i3}=\mathbf 1\{\text{observation }i\text{ has level }3\}.
\end{align*}
For the grouped coefficient vector $\beta_G=(\beta_1,\beta_2,\beta_3)$, the factor contribution to the fitted value for observation $i$ is
\begin{align*}
D_{i1}\beta_1+D_{i2}\beta_2+D_{i3}\beta_3.
\end{align*}
If observation $i$ has the baseline level $0$, then $D_{i1}=D_{i2}=D_{i3}=0$, so the contribution is
\begin{align*}
0\cdot \beta_1+0\cdot \beta_2+0\cdot \beta_3=0.
\end{align*}
If observation $i$ has level $1$, then $(D_{i1},D_{i2},D_{i3})=(1,0,0)$, so the contribution is
\begin{align*}
1\cdot\beta_1+0\cdot\beta_2+0\cdot\beta_3=\beta_1.
\end{align*}
Similarly, level $2$ gives $\beta_2$ and level $3$ gives $\beta_3$. Thus the three coefficients are exactly the fitted contrasts of levels $1,2,3$ against the baseline level.
Treat these three dummy variables as one group $G=\{1,2,3\}$. In the group Lasso objective, the factor is penalised through the single term
\begin{align*}
\lambda a_G|\beta_G|
=
\lambda a_G(\beta_1^2+\beta_2^2+\beta_3^2)^{1/2}.
\end{align*}
This term is zero precisely when
\begin{align*}
(\beta_1^2+\beta_2^2+\beta_3^2)^{1/2}=0,
\end{align*}
which is equivalent to
\begin{align*}
\beta_1^2+\beta_2^2+\beta_3^2=0.
\end{align*}
Since each square is nonnegative, this happens exactly when
\begin{align*}
\beta_1=\beta_2=\beta_3=0.
\end{align*}
Therefore, if the group is inactive, every contrast with the baseline is zero; if the group is active, the fitted coefficients describe the nonzero level contrasts together. The penalty matches the modelling unit: the categorical predictor enters or leaves as a factor, rather than selecting one level contrast while discarding the rest without a factor-level reason.
[/example]
The categorical example is well matched to pure group selection because a factor often enters or leaves as a unit. Basis expansions and interaction dictionaries can be different: the group may be meaningful, while only a few coordinates inside an active group are needed. This motivates adding coordinate sparsity back into the group penalty.
[definition: Sparse-Group Lasso Estimator]
Let $\mathcal G$ be a group partition, let $a_G>0$, and let $\alpha\in[0,1]$. The sparse-group Lasso estimator is any solution
\begin{align*}
\hat{\beta}^{\mathrm{SGL}} \in \operatorname*{arg\,min}_{\beta\in\mathbb R^p}
\left\{\frac{1}{2n}|Y-X\beta|^2
+ \lambda\left((1-\alpha)\sum_{G\in\mathcal G}a_G|\beta_G| + \alpha\|\beta\|_1\right)\right\}.
\end{align*}
[/definition]
At $\alpha=0$ this is the group Lasso, and at $\alpha=1$ it is the ordinary Lasso. Intermediate values allow whole groups to disappear while also allowing only a subset of coordinates inside an active group to survive.
[example: Sparse Polynomial Dictionary by Variable]
For a raw covariate $u_m$, write the centred and normalised polynomial columns as
\begin{align*}
x_{m1},\qquad x_{m2},\qquad x_{m3},
\end{align*}
corresponding to $u_m$, $u_m^2$, and $u_m^3$, and put their coefficients in the group
\begin{align*}
\beta_{G_m}=(\beta_{m1},\beta_{m2},\beta_{m3}).
\end{align*}
The fitted contribution of this variable is
\begin{align*}
X_{G_m}\beta_{G_m}
=
x_{m1}\beta_{m1}+x_{m2}\beta_{m2}+x_{m3}\beta_{m3}.
\end{align*}
Its contribution to the sparse-group penalty is
\begin{align*}
\lambda\left((1-\alpha)a_{G_m}|\beta_{G_m}|+\alpha\|\beta_{G_m}\|_1\right)
=
\lambda\left((1-\alpha)a_{G_m}(\beta_{m1}^2+\beta_{m2}^2+\beta_{m3}^2)^{1/2}
+\alpha(|\beta_{m1}|+|\beta_{m2}|+|\beta_{m3}|)\right).
\end{align*}
If $u_m$ is removed as a whole block, then
\begin{align*}
\beta_{m1}=\beta_{m2}=\beta_{m3}=0,
\end{align*}
so
\begin{align*}
X_{G_m}\beta_{G_m}
=
x_{m1}\cdot 0+x_{m2}\cdot 0+x_{m3}\cdot 0
=
0,
\end{align*}
and the penalty contribution is
\begin{align*}
\lambda\left((1-\alpha)a_{G_m}(0^2+0^2+0^2)^{1/2}
+\alpha(|0|+|0|+|0|)\right)
=
\lambda\left((1-\alpha)a_{G_m}\cdot 0+\alpha\cdot 0\right)
=
0.
\end{align*}
If the variable is relevant only through its linear and quadratic terms, the fit may instead have
\begin{align*}
\beta_{m1}\ne 0,\qquad \beta_{m2}\ne 0,\qquad \beta_{m3}=0.
\end{align*}
Then the fitted contribution becomes
\begin{align*}
X_{G_m}\beta_{G_m}
=
x_{m1}\beta_{m1}+x_{m2}\beta_{m2}+x_{m3}\cdot 0
=
x_{m1}\beta_{m1}+x_{m2}\beta_{m2},
\end{align*}
while the penalty contribution is
\begin{align*}
\lambda\left((1-\alpha)a_{G_m}(\beta_{m1}^2+\beta_{m2}^2+0^2)^{1/2}
+\alpha(|\beta_{m1}|+|\beta_{m2}|+|0|)\right)
=
\lambda\left((1-\alpha)a_{G_m}(\beta_{m1}^2+\beta_{m2}^2)^{1/2}
+\alpha(|\beta_{m1}|+|\beta_{m2}|)\right).
\end{align*}
Thus the group norm can delete all three polynomial features together, while the $\ell_1$ part still has coordinate-level terms that allow the cubic coefficient to be zero inside an active polynomial block.
[/example]
Structured penalties are not limited to partitions. Overlapping groups, hierarchical penalties, fused penalties, and graph-guided penalties encode different prior beliefs about which coefficients should enter together or differ smoothly. The common theme is that the penalty changes the shape of the feasible geometry so that the estimator favours structured low-complexity patterns rather than unstructured coordinate sparsity.
[remark: Choosing a Structured Penalty]
The structure should come from the design or the scientific question, not from looking at the response and then inventing groups. Examples include factor levels, basis expansions, spatial neighbourhoods, time ordering, and known interaction hierarchies. Once the structure is chosen, the tuning parameter still controls the strength of regularisation and should be selected using validation, information criteria, or theory appropriate to the goal.
[/remark]
## Connections and Tradeoffs
The elastic net, adaptive Lasso, and group penalties all modify the Lasso by changing what it means for a coefficient vector to be simple. Elastic net simplicity combines small $\ell_1$ norm with small Euclidean norm; adaptive Lasso simplicity is coordinate-dependent; group Lasso simplicity is block sparsity. These modifications preserve convexity in the formulations discussed here, which is why KKT analysis and efficient algorithms remain available.
The price is that each method introduces extra modelling choices. Elastic net requires balancing $\lambda_1$ and $\lambda_2$, adaptive Lasso requires a pilot estimator and a weight exponent, and group methods require a group structure and group weights. These choices are not cosmetic: they determine the geometry of shrinkage and therefore the estimator's bias, stability, and selection behaviour.
[explanation: How This Chapter Fits the Course]
Earlier chapters developed the Lasso as the basic sparse estimator and analysed its KKT conditions, prediction bounds, and support recovery conditions. This chapter shows how the same convex-analytic machinery extends to penalties designed for collinearity, unequal coordinate importance, and grouped structure. Chapter 7 next moves from penalised formulations to a constrained estimator, the Dantzig selector; Chapters 8 and 9 then return to bias reduction and inference, where the same themes of bias, variance, and identifiability reappear in sharper forms.
[/explanation]
Having seen how alternative penalties address some of the Lasso's weaknesses, the course next turns to a different constrained formulation of sparse estimation. The Dantzig selector uses the same score control and cone-based geometry, but expresses estimation through approximate normal equations rather than direct penalisation.
# 7. Dantzig Selector and Constrained Estimation
This chapter introduces the Dantzig selector as an alternative route to sparse high-dimensional regression. It uses the fixed-design linear model and score concentration from Chapter 1, the Lasso KKT and basic inequality from Chapter 2, and the restricted eigenvalue arguments from Chapters 3 and 4. The Lasso controls the residual score through its KKT equations, while the Dantzig selector builds that control directly into the feasible set and then minimises the $\ell_1$ norm. The chapter compares these two viewpoints, develops the same cone geometry behind the basic oracle bounds, and records how restricted eigenvalue assumptions transfer estimation and prediction rates between the two methods.
## The Score Constraint Behind the Dantzig Selector
The starting question is how to estimate a sparse coefficient vector when the natural first-order score
\begin{align*}
\frac{X^\top(Y-X\beta)}{n}
\end{align*}
cannot be forced to vanish because $p$ may exceed $n$. In low-dimensional least squares the normal equations set this score equal to zero. In high dimensions, we instead ask for all empirical correlations between covariates and residuals to be uniformly small.
Throughout this chapter let $Y = X\beta^* + \varepsilon$, where $X \in \mathbb R^{n \times p}$ is fixed, $Y \in \mathbb R^n$, $\beta^* \in \mathbb R^p$, and $\varepsilon \in \mathbb R^n$ is noise. Write $S = \operatorname{supp}(\beta^*)$ and $s = |S|$. We assume the columns are normalised so that
\begin{align*}
\frac{\|X_j\|_2^2}{n} \le 1, \qquad j=1,\dots,p.
\end{align*}
[definition: Dantzig Selector]
For $\lambda > 0$, the Dantzig selector is any solution
\begin{align*}
\hat\beta_{\mathrm{DS}} \in \operatorname*{argmin}_{\beta \in \mathbb R^p} \|\beta\|_1
\end{align*}
subject to
\begin{align*}
\left\|\frac{X^\top(Y-X\beta)}{n}\right\|_\infty \le \lambda.
\end{align*}
[/definition]
The constraint says that no covariate has empirical correlation larger than $\lambda$ with the fitted residual. The objective then selects the sparsest point, in the convex $\ell_1$ sense, among all coefficients that satisfy these approximate normal equations.
[remark: Feasible Set Geometry]
The Dantzig feasible set is the intersection of $2p$ affine half-spaces in $\mathbb R^p$. It is therefore a convex polyhedron, and the objective $\|\beta\|_1$ is convex but not differentiable. This makes the estimator computationally accessible and also places it in the same polyhedral geometry as basis pursuit and constrained Lasso.
[/remark]
The feasible set description raises the statistical question that must be answered before any comparison with $\beta^*$ is possible: does the true coefficient vector lie in this polyhedron with high probability? The next theorem isolates the exact score event needed for this, and it motivates the later choice of $\lambda$.
[quotetheorem:5573]
[citeproof:5573]
The score hypothesis is essential: if some column has
\begin{align*}
\frac{|X_j^\top\varepsilon|}{n}>\lambda,
\end{align*}
then the true coefficient vector fails the very constraint over which the estimator optimises. For instance, with $p=1$,
\begin{align*}
\frac{X_1^\top X_1}{n}=1
\qquad\text{and}\qquad
\frac{|X_1^\top\varepsilon|}{n}>\lambda,
\end{align*}
the true value is outside the Dantzig feasible interval. The theorem does not say that every feasible point is close to $\beta^*$, only that $\beta^*$ is available as a comparison point for the $\ell_1$ minimisation. This short result is the entry point for the whole analysis: concentration bounds from the first chapter give the event with high probability when the noise is sub-Gaussian, so the deterministic inequalities below may be read conditionally on that event.
[example: Gaussian Score Calibration]
Suppose $\varepsilon \sim \mathcal N(0,\sigma^2 I_n)$, fix $\delta\in(0,1)$, and assume $\|X_j\|_2^2/n\le 1$ for every column. For a fixed coordinate $j$,
\begin{align*}
X_j^\top\varepsilon=\sum_{i=1}^n X_{ij}\varepsilon_i.
\end{align*}
A linear combination of jointly Gaussian variables is Gaussian, so $X_j^\top\varepsilon$ is Gaussian. Its mean is
\begin{align*}
\mathbb E[X_j^\top\varepsilon]=\sum_{i=1}^n X_{ij}\mathbb E[\varepsilon_i]=0.
\end{align*}
Because the coordinates of $\varepsilon$ are independent and each has variance $\sigma^2$,
\begin{align*}
\operatorname{Var}(X_j^\top\varepsilon)=\sum_{i=1}^n X_{ij}^2\operatorname{Var}(\varepsilon_i)=\sigma^2\sum_{i=1}^n X_{ij}^2=\sigma^2\|X_j\|_2^2.
\end{align*}
Therefore
\begin{align*}
\frac{X_j^\top\varepsilon}{n}\sim \mathcal N\left(0,\frac{\sigma^2\|X_j\|_2^2}{n^2}\right).
\end{align*}
The column normalisation gives
\begin{align*}
\frac{\sigma^2\|X_j\|_2^2}{n^2}\le \frac{\sigma^2}{n}.
\end{align*}
Set
\begin{align*}
t=\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
For each fixed $j$, the centered Gaussian tail bound with variance at most $\sigma^2/n$ gives
\begin{align*}
\mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|>t\right)\le 2\exp\left(-\frac{nt^2}{2\sigma^2}\right).
\end{align*}
Substituting the value of $t$,
\begin{align*}
\frac{nt^2}{2\sigma^2}=\frac{n}{2\sigma^2}\cdot \sigma^2\frac{2\log(2p/\delta)}{n}=\log(2p/\delta).
\end{align*}
Hence
\begin{align*}
2\exp\left(-\frac{nt^2}{2\sigma^2}\right)=2\exp(-\log(2p/\delta))=2\cdot\frac{\delta}{2p}=\frac{\delta}{p}.
\end{align*}
Thus, for every $j$,
\begin{align*}
\mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|>t\right)\le \frac{\delta}{p}.
\end{align*}
Applying the union bound over the $p$ coordinates,
\begin{align*}
\mathbb P\left(\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty>t\right)\le \sum_{j=1}^p \mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|>t\right).
\end{align*}
Using the coordinate bound,
\begin{align*}
\sum_{j=1}^p \mathbb P\left(\left|\frac{X_j^\top\varepsilon}{n}\right|>t\right)\le \sum_{j=1}^p \frac{\delta}{p}=\delta.
\end{align*}
Equivalently,
\begin{align*}
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty \le \sigma\sqrt{\frac{2\log(2p/\delta)}{n}}
\end{align*}
with probability at least $1-\delta$. On this event, the feasibility theorem applies, so choosing $\lambda$ at least this large makes $\beta^*$ feasible for the Dantzig selector with probability at least $1-\delta$.
[/example]
The optimisation can also be written in a form that standard linear programming solvers accept. This matters because the Dantzig selector was originally attractive not only for its theory but also for its direct constrained formulation.
[example: Linear Programming Formulation]
Introduce auxiliary variables $u_1,\dots,u_p$ and impose
\begin{align*}
u_j \ge 0,\qquad -u_j \le \beta_j \le u_j,\qquad j=1,\dots,p.
\end{align*}
For each coordinate, these two inequalities imply $\beta_j\le u_j$ and $-\beta_j\le u_j$, hence $|\beta_j|\le u_j$. Therefore
\begin{align*}
\|\beta\|_1=\sum_{j=1}^p |\beta_j|\le \sum_{j=1}^p u_j.
\end{align*}
Conversely, for any fixed $\beta$, choosing $u_j=|\beta_j|$ satisfies $u_j\ge 0$ and
\begin{align*}
-| \beta_j| \le \beta_j \le |\beta_j|,
\end{align*}
so the smallest possible value of $\sum_{j=1}^p u_j$ over admissible $u$ is exactly $\sum_{j=1}^p|\beta_j|=\|\beta\|_1$.
Now rewrite the Dantzig score constraint by expanding the residual:
\begin{align*}
\frac{X^\top(Y-X\beta)}{n}
=
\frac{X^\top Y}{n}-\frac{X^\top X}{n}\beta.
\end{align*}
Thus
\begin{align*}
\left\|\frac{X^\top(Y-X\beta)}{n}\right\|_\infty \le \lambda
\end{align*}
is equivalent, coordinate by coordinate, to
\begin{align*}
-\lambda \le \left(\frac{X^\top Y}{n}-\frac{X^\top X}{n}\beta\right)_j \le \lambda,
\qquad j=1,\dots,p,
\end{align*}
or in vector form,
\begin{align*}
-\lambda \mathbf 1 \le \frac{X^\top Y}{n}-\frac{X^\top X}{n}\beta \le \lambda \mathbf 1.
\end{align*}
The Dantzig selector is therefore equivalently obtained by minimising
\begin{align*}
\sum_{j=1}^p u_j
\end{align*}
over $(\beta,u)\in\mathbb R^{2p}$ subject to
\begin{align*}
u_j\ge 0,\qquad -u_j\le \beta_j\le u_j,\qquad j=1,\dots,p,
\end{align*}
and
\begin{align*}
-\lambda \mathbf 1 \le \frac{X^\top Y}{n}-\frac{X^\top X}{n}\beta \le \lambda \mathbf 1.
\end{align*}
The objective is linear in $u$, and every displayed constraint is affine in $(\beta,u)$, so the Dantzig selector can be written as a linear program.
[/example]
## Cone Geometry and Deterministic Error Bounds
The next question is what feasibility and $\ell_1$ minimality imply about the error vector $\Delta = \hat\beta_{\mathrm{DS}}-\beta^*$. The proof follows the same broad outline as the Lasso proof: first show that $\Delta$ belongs to a cone determined by the true support, then use a restricted eigenvalue assumption to convert a score inequality into prediction and estimation bounds.
[definition: Dantzig Cone]
Let $S \subseteq \{1,\dots,p\}$. The Dantzig cone associated with $S$ is
\begin{align*}
\mathcal C(S) = \{\Delta \in \mathbb R^p : \|\Delta_{S^c}\|_1 \le \|\Delta_S\|_1\}.
\end{align*}
[/definition]
This cone is slightly sharper than the common Lasso cone with constant $3$, because the Dantzig selector minimises $\ell_1$ over a feasible set containing $\beta^*$ rather than trading the empirical loss against an $\ell_1$ penalty. The next lemma is the algebraic reason.
[quotetheorem:5574]
[citeproof:5574]
Both hypotheses do work here. If $\beta^*$ were not feasible, the comparison $\|\hat\beta_{\mathrm{DS}}\|_1\le \|\beta^*\|_1$ would have no reason to hold; if $S$ did not contain the support of $\beta^*$, then the off-support term would include genuine signal coordinates and the displayed split would fail. For example, take $\beta^*=(1,2,0,\dots,0)$ but set $S=\{1\}$; then even the zero estimator gives $\Delta=(-1,-2,0,\dots,0)$, so $\|\Delta_{S^c}\|_1=2>\|\Delta_S\|_1=1$ before any design information is used. The theorem does not say that $\Delta$ is small, since a vector may lie in the cone and still be nearly invisible to the design. This motivates a restricted eigenvalue assumption, whose role is to rule out sparse perturbations hidden in the null space of $X$.
[definition: Restricted Eigenvalue Constant]
The restricted eigenvalue constant is the map
\begin{align*}
\kappa:\{(S,c):S\subseteq\{1,\dots,p\},\ c>0\}\longrightarrow [0,\infty)
\end{align*}
defined by
\begin{align*}
\kappa(S,c) = \inf\left\{ \frac{\|X\Delta\|_2}{\sqrt n\|\Delta_S\|_2} : \Delta \ne 0,\ \|\Delta_{S^c}\|_1 \le c\|\Delta_S\|_1 \right\}.
\end{align*}
[/definition]
When $\kappa(S,c)>0$, vectors in the cone cannot hide in the null space of $X$. A concrete failure occurs if two columns are identical: for $S=\{1\}$, the vector $\Delta=(-1,1,0,\dots,0)$ satisfies $\|\Delta_{S^c}\|_1=\|\Delta_S\|_1$ but $X\Delta=0$, so $\kappa(S,1)=0$. The next oracle inequality combines this curvature with the Dantzig score constraint to produce the prediction and $\ell_1$ estimation rates that will later be compared with the Lasso.
[quotetheorem:5575]
[citeproof:5575]
The restricted eigenvalue hypothesis is essential: with duplicated columns, a non-zero cone vector can satisfy $X\Delta=0$, and no prediction inequality can force an $\ell_1$ error bound. Feasibility is separately essential for this oracle inequality, not merely for the earlier cone statement: it is used twice, first to compare $\|\hat\beta_{\mathrm{DS}}\|_1$ with $\|\beta^*\|_1$, and then to bound $\|X^\top X\Delta/n\|_\infty$ by $2\lambda$ from the two score constraints. The theorem deliberately does not assert a full $\ell_2$ bound, because the displayed restricted eigenvalue controls $\|\Delta_S\|_2$ through prediction error but does not by itself control all of $\|\Delta\|_2$. The constants in this displayed version are less important than the dependence on $\lambda$, $s$, and the restricted eigenvalue. With $\lambda \asymp \sigma\sqrt{\log p/n}$, the prediction rate becomes $s\sigma^2\log p/n$ up to constants depending on the restricted eigenvalue.
[example: Orthogonal Design]
Assume $p\le n$ and
\begin{align*}
\frac{X^\top X}{n}=I_p.
\end{align*}
Write
\begin{align*}
z=\frac{X^\top Y}{n}.
\end{align*}
For any $\beta\in\mathbb R^p$,
\begin{align*}
\frac{X^\top(Y-X\beta)}{n}
=
\frac{X^\top Y}{n}-\frac{X^\top X}{n}\beta.
\end{align*}
Using $z=X^\top Y/n$ and $X^\top X/n=I_p$, this becomes
\begin{align*}
\frac{X^\top(Y-X\beta)}{n}=z-\beta.
\end{align*}
Thus the Dantzig constraint is
\begin{align*}
\|z-\beta\|_\infty\le \lambda,
\end{align*}
which is equivalent to the coordinatewise conditions
\begin{align*}
|z_j-\beta_j|\le \lambda,\qquad j=1,\dots,p.
\end{align*}
The objective and the constraints now separate by coordinate, since
\begin{align*}
\|\beta\|_1=\sum_{j=1}^p|\beta_j|
\end{align*}
and the $j$th constraint only involves $\beta_j$. For a fixed $j$, the feasible set is the interval
\begin{align*}
z_j-\lambda\le \beta_j\le z_j+\lambda.
\end{align*}
If $z_j>\lambda$, then every feasible $\beta_j$ is positive or zero with the smallest absolute value at
\begin{align*}
\beta_j=z_j-\lambda.
\end{align*}
If $z_j<-\lambda$, then every feasible $\beta_j$ is negative or zero with the smallest absolute value at
\begin{align*}
\beta_j=z_j+\lambda.
\end{align*}
If $|z_j|\le \lambda$, then $0$ belongs to the interval, so the minimum of $|\beta_j|$ is attained at
\begin{align*}
\beta_j=0.
\end{align*}
Therefore the Dantzig selector is coordinatewise soft-thresholding:
\begin{align*}
\hat\beta_{\mathrm{DS},j}=\operatorname{sign}(z_j)(|z_j|-\lambda)_+,\qquad z_j=\frac{X_j^\top Y}{n}.
\end{align*}
The Lasso has the same coordinate form in this orthogonal case. Expanding its objective gives
\begin{align*}
\frac{1}{2n}\|Y-X\beta\|_2^2+\lambda\|\beta\|_1
=
\frac{Y^\top Y}{2n}-\frac{Y^\top X\beta}{n}+\frac{\beta^\top X^\top X\beta}{2n}+\lambda\sum_{j=1}^p|\beta_j|.
\end{align*}
Substituting $z=X^\top Y/n$ and $X^\top X/n=I_p$ gives
\begin{align*}
\frac{1}{2n}\|Y-X\beta\|_2^2+\lambda\|\beta\|_1
=
\frac{Y^\top Y}{2n}+\sum_{j=1}^p\left(\frac{1}{2}\beta_j^2-z_j\beta_j+\lambda|\beta_j|\right).
\end{align*}
The constant $Y^\top Y/(2n)$ does not affect the minimiser. For $\beta_j>0$, the derivative of the $j$th coordinate term is $\beta_j-z_j+\lambda$, so the positive critical point is $\beta_j=z_j-\lambda$, valid exactly when $z_j>\lambda$. For $\beta_j<0$, the derivative is $\beta_j-z_j-\lambda$, so the negative critical point is $\beta_j=z_j+\lambda$, valid exactly when $z_j<-\lambda$. At $\beta_j=0$, the subgradient condition is
\begin{align*}
0\in -z_j+\lambda[-1,1],
\end{align*}
which is equivalent to
\begin{align*}
|z_j|\le \lambda.
\end{align*}
Hence
\begin{align*}
\hat\beta_{\mathrm{L},j}=\operatorname{sign}(z_j)(|z_j|-\lambda)_+=\hat\beta_{\mathrm{DS},j}.
\end{align*}
In an orthogonal design, the Dantzig selector and the Lasso both reduce to coordinatewise soft-thresholding of $X^\top Y/n$; their differences arise from correlations among the columns of $X$.
[/example]
## Relationship with the Lasso
The Dantzig selector and the Lasso ask different optimisation questions, but both control the same empirical score. The Lasso obtains score control as a consequence of the KKT conditions, while the Dantzig selector imposes score control before minimising the $\ell_1$ norm. This section makes that comparison precise enough to transfer rates.
[definition: Lasso Estimator]
For $\lambda_L>0$, the Lasso estimator is any solution
\begin{align*}
\hat\beta_{\mathrm{L}} \in \operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{1}{2n}\|Y-X\beta\|_2^2+\lambda_L\|\beta\|_1\right\}.
\end{align*}
[/definition]
The KKT conditions say that the Lasso residual score belongs to the subdifferential of the $\ell_1$ norm scaled by $\lambda_L$. Thus every Lasso solution is feasible for a Dantzig problem at the same score radius.
[quotetheorem:5576]
[citeproof:5576]
The optimisation hypothesis is necessary in three separate ways. First, the residual-score bound is an optimality statement, not a statement about arbitrary trial vectors: a coefficient vector that is not a Lasso minimiser need not satisfy any residual-score bound, even in one dimension. For example, if
\begin{align*}
\frac{X_1^\top X_1}{n}=1
\end{align*}
and $Y$ is fixed, taking a very large trial value for $\beta$ can make
\begin{align*}
\frac{|X_1^\top(Y-X\beta)|}{n}
\end{align*}
exceed $\lambda_L$. Second, the theorem uses existence of an actual minimiser; the conclusion is attached to a point where the infimum is attained, not to a merely improving sequence of objective values. Third, the radius is the same tuning parameter that appears in the penalty: if the Lasso is fitted with $\lambda_L$, the KKT equation gives a Dantzig score bound at radius $\lambda_L$, but it gives no bound at a smaller radius. The convex analytic structure is what permits this exact conclusion. The squared loss is convex and continuous, the $\ell_1$ penalty is convex, and for $\lambda_L>0$ the objective is coercive, so a minimiser exists; at such a minimiser the zero-subgradient condition is necessary and sufficient. The theorem does not say that a Dantzig solution is a Lasso solution, only that every Lasso optimum lies inside the corresponding Dantzig score-feasible region. This explains the phrase score constraint: the Dantzig selector begins with the inequality that the Lasso satisfies at optimum, while the remaining difference is the optimisation criterion inside that region.
[remark: Dual Geometry]
The Lasso penalty is the support function of the $\ell_\infty$ unit ball, and its KKT equations match residual correlations to faces of that ball. The Dantzig selector instead cuts coefficient space by an $\ell_\infty$ bound on the score and minimises the $\ell_1$ gauge over the resulting polyhedron. Both estimators are governed by the duality between $\ell_1$ and $\ell_\infty$.
[/remark]
In correlated designs, the two estimators may select different points even with matched tuning parameters. A small calculation makes the distinction visible.
[example: Correlated Design]
Let $p=2$ and write
\begin{align*}
G=\frac{X^\top X}{n},
\qquad
G_{11}=G_{22}=1,
\qquad
G_{12}=G_{21}=\rho,
\qquad
0<\rho<1,
\end{align*}
with
\begin{align*}
\frac{X^\top Y}{n}=(a,a),
\qquad
a>\lambda.
\end{align*}
The data and Gram matrix are invariant under swapping the two coordinates. If $\beta$ is Dantzig-feasible, then the swapped vector is also feasible, and their average is feasible by convexity of the score constraint. The $\ell_1$ norm of the average is no larger than the average of the two $\ell_1$ norms, so a Dantzig solution may be chosen on the diagonal $\beta_1=\beta_2=t$.
For $\beta=(t,t)$, the two coordinates of $G\beta$ are
\begin{align*}
(G\beta)_1=t+\rho t=(1+\rho)t,
\qquad
(G\beta)_2=\rho t+t=(1+\rho)t.
\end{align*}
Thus
\begin{align*}
\frac{X^\top(Y-X\beta)}{n}
=
\frac{X^\top Y}{n}-G\beta
=
\bigl(a-(1+\rho)t,\ a-(1+\rho)t\bigr).
\end{align*}
The Dantzig constraint is therefore
\begin{align*}
|a-(1+\rho)t|\le \lambda,
\end{align*}
which is equivalent to
\begin{align*}
a-\lambda \le (1+\rho)t \le a+\lambda.
\end{align*}
Since $1+\rho>0$,
\begin{align*}
\frac{a-\lambda}{1+\rho}\le t\le \frac{a+\lambda}{1+\rho}.
\end{align*}
Because $a>\lambda$, this interval is positive, and the diagonal objective is
\begin{align*}
\|\beta\|_1=|t|+|t|=2t.
\end{align*}
The smallest diagonal feasible point is therefore
\begin{align*}
t=\frac{a-\lambda}{1+\rho}.
\end{align*}
The Lasso agrees in this symmetric two-coordinate case. Ignoring the constant $Y^\top Y/(2n)$, its objective is
\begin{align*}
\frac{1}{2}\beta^\top G\beta-\left(\frac{X^\top Y}{n}\right)^\top\beta+\lambda\|\beta\|_1.
\end{align*}
On the diagonal $\beta=(t,t)$ with $t>0$,
\begin{align*}
G\beta=\bigl((1+\rho)t,\ (1+\rho)t\bigr),
\end{align*}
so
\begin{align*}
\frac{1}{2}\beta^\top G\beta
=
\frac{1}{2}\bigl(t(1+\rho)t+t(1+\rho)t\bigr)
=
(1+\rho)t^2.
\end{align*}
Also,
\begin{align*}
-\left(\frac{X^\top Y}{n}\right)^\top\beta+\lambda\|\beta\|_1
=
-(a t+a t)+2\lambda t
=
-2at+2\lambda t.
\end{align*}
Thus the diagonal objective is
\begin{align*}
(1+\rho)t^2-2at+2\lambda t.
\end{align*}
Its derivative is
\begin{align*}
2(1+\rho)t-2a+2\lambda,
\end{align*}
so the positive critical point satisfies
\begin{align*}
(1+\rho)t=a-\lambda.
\end{align*}
Hence the Lasso diagonal solution has
\begin{align*}
t=\frac{a-\lambda}{1+\rho},
\end{align*}
the same value as the Dantzig selector. This agreement is a useful baseline: correlation alone does not force the methods apart when the score vector has the same symmetry as the Gram matrix.
For a concrete separation, take three coordinates and set $G=X^\top X/n$ by
\begin{align*}
G_{11}=G_{22}=G_{33}=1,
\qquad
G_{12}=G_{21}=G_{13}=G_{31}=\frac{1}{2},
\qquad
G_{23}=G_{32}=0,
\end{align*}
with
\begin{align*}
\frac{X^\top Y}{n}=(4,4,3),
\qquad
\lambda=1.
\end{align*}
For $\beta_{\mathrm{DS}}=(4,1,0)$, the coordinates of $G\beta_{\mathrm{DS}}$ are
\begin{align*}
(G\beta_{\mathrm{DS}})_1=4+\frac{1}{2}+0=\frac{9}{2},
\qquad
(G\beta_{\mathrm{DS}})_2=2+1+0=3,
\qquad
(G\beta_{\mathrm{DS}})_3=2+0+0=2.
\end{align*}
Therefore
\begin{align*}
\frac{X^\top Y}{n}-G\beta_{\mathrm{DS}}
=
\left(4-\frac{9}{2},\ 4-3,\ 3-2\right)
=
\left(-\frac{1}{2},\ 1,\ 1\right).
\end{align*}
Its $\ell_\infty$ norm is $1$, so $\beta_{\mathrm{DS}}$ is feasible for the Dantzig constraint at radius $\lambda=1$.
Now let $\beta=(\beta_1,\beta_2,\beta_3)$ be any feasible point. The second score coordinate is
\begin{align*}
4-\left(\frac{1}{2}\beta_1+\beta_2\right),
\end{align*}
so feasibility gives
\begin{align*}
\left|4-\left(\frac{1}{2}\beta_1+\beta_2\right)\right|\le 1.
\end{align*}
Hence
\begin{align*}
-1\le 4-\left(\frac{1}{2}\beta_1+\beta_2\right)\le 1,
\end{align*}
and therefore
\begin{align*}
\frac{1}{2}\beta_1+\beta_2\ge 3.
\end{align*}
Similarly, the third score coordinate is
\begin{align*}
3-\left(\frac{1}{2}\beta_1+\beta_3\right),
\end{align*}
and feasibility gives
\begin{align*}
\left|3-\left(\frac{1}{2}\beta_1+\beta_3\right)\right|\le 1.
\end{align*}
Thus
\begin{align*}
\frac{1}{2}\beta_1+\beta_3\ge 2.
\end{align*}
Adding the two lower bounds yields
\begin{align*}
\left(\frac{1}{2}\beta_1+\beta_2\right)+\left(\frac{1}{2}\beta_1+\beta_3\right)\ge 3+2,
\end{align*}
so
\begin{align*}
\beta_1+\beta_2+\beta_3\ge 5.
\end{align*}
Since $|x|\ge x$ for every real $x$,
\begin{align*}
\|\beta\|_1
=
|\beta_1|+|\beta_2|+|\beta_3|
\ge
\beta_1+\beta_2+\beta_3
\ge 5.
\end{align*}
The displayed feasible point has
\begin{align*}
\|\beta_{\mathrm{DS}}\|_1=|4|+|1|+|0|=5,
\end{align*}
so it is a Dantzig solution.
The Lasso with the same tuning parameter selects a different point,
\begin{align*}
\beta_{\mathrm{L}}=\left(1,\frac{5}{2},\frac{3}{2}\right).
\end{align*}
For this vector, the coordinates of $G\beta_{\mathrm{L}}$ are
\begin{align*}
(G\beta_{\mathrm{L}})_1=1+\frac{5}{4}+\frac{3}{4}=3,
\qquad
(G\beta_{\mathrm{L}})_2=\frac{1}{2}+\frac{5}{2}+0=3,
\qquad
(G\beta_{\mathrm{L}})_3=\frac{1}{2}+0+\frac{3}{2}=2.
\end{align*}
Therefore
\begin{align*}
\frac{X^\top Y}{n}-G\beta_{\mathrm{L}}
=
(4-3,\ 4-3,\ 3-2)
=
(1,1,1).
\end{align*}
The smooth part of the Lasso objective has gradient
\begin{align*}
G\beta-\frac{X^\top Y}{n}.
\end{align*}
At $\beta_{\mathrm{L}}$ this gradient is
\begin{align*}
G\beta_{\mathrm{L}}-\frac{X^\top Y}{n}=(-1,-1,-1).
\end{align*}
All three coordinates of $\beta_{\mathrm{L}}$ are positive, hence
\begin{align*}
\partial\|\beta_{\mathrm{L}}\|_1=\{(1,1,1)\}.
\end{align*}
With $\lambda=1$,
\begin{align*}
\left(G\beta_{\mathrm{L}}-\frac{X^\top Y}{n}\right)+\lambda(1,1,1)
=
(-1,-1,-1)+(1,1,1)
=
(0,0,0).
\end{align*}
This is the Lasso first-order optimality condition, and the objective is convex, so $\beta_{\mathrm{L}}$ is a Lasso solution. Since
\begin{align*}
\beta_{\mathrm{DS}}=(4,1,0)
\qquad\text{and}\qquad
\beta_{\mathrm{L}}=\left(1,\frac{5}{2},\frac{3}{2}\right),
\end{align*}
the two estimators are not the same. The Dantzig selector chooses a smallest $\ell_1$ point in a score polyhedron, while the Lasso chooses the point where the quadratic loss gradient balances the $\ell_1$ subgradient.
[/example]
The comparison theorem is therefore a statement about rates, not equality of estimators. Under compatible restricted eigenvalue conditions and matched tuning, both attain the same sparsity-scaled order.
[quotetheorem:5577]
[citeproof:5577]
The score event cannot be dropped: if
\begin{align*}
\frac{X^\top\varepsilon}{n}
\end{align*}
is much larger than $\lambda$ in $\ell_\infty$ norm, the stochastic term in the [Lasso basic inequality](/theorems/5555) and the Dantzig feasibility comparison may both dominate the curvature term. The shared restricted eigenvalue lower bound is also needed, since correlated or duplicated columns can produce cone directions whose prediction norm is zero. The theorem does not claim that the estimators select the same variables or have identical finite-sample constants; it only compares the order of their prediction and $\ell_1$ errors under matched assumptions.
This result is often the practical takeaway: under the assumptions used in this course, the Dantzig selector does not improve the main order of the Lasso rates from Chapter 4, but it gives a different convex formulation. The original Candes--Tao viewpoint came from compressed sensing, where the unknown vector is sparse and the measurements are insufficient to identify all coordinates without exploiting that sparsity. In the noiseless version, basis pursuit searches for the smallest $\ell_1$ vector among all vectors matching the measurements. With noise, exact matching is too rigid, so the feasible set is widened into a tube around the measurement equations. The Dantzig selector transports this geometry to regression by replacing a residual tube with a score tube: instead of requiring the residual itself to be small in Euclidean norm, it requires every residual correlation with a design column to be small.
The constraint
\begin{align*}
\left\|\frac{X^\top(Y-X\beta)}{n}\right\|_\infty \le \lambda
\end{align*}
is therefore a confidence-region interpretation of the normal equations: it keeps all empirical score coordinates inside the range expected from the noise. The objective then chooses the smallest $\ell_1$ point inside this score-confidence polyhedron. Linear programming duality is not only computational language here: the primal $\ell_1$ objective and the dual $\ell_\infty$ score bound are the paired norms that make sparse recovery and high-dimensional regression share the same geometry.
## Choosing and Interpreting the Constraint Radius
The final question is how to read the tuning parameter $\lambda$. In both Lasso and Dantzig analyses, $\lambda$ must dominate the random score
\begin{align*}
\frac{X^\top\varepsilon}{n}.
\end{align*}
If it is too small, $\beta^*$ may be infeasible; if it is too large, the estimator may be overly conservative.
[definition: Score Event]
For $\lambda>0$, the score event at radius $\lambda$ is
\begin{align*}
\mathcal E_\lambda = \left\{\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty \le \lambda\right\}.
\end{align*}
[/definition]
This event separates probability from optimisation. The probabilistic part chooses $\lambda$ so that $\mathcal E_\lambda$ has high probability; the deterministic part proves oracle inequalities on $\mathcal E_\lambda$.
[example: Rate Substitution]
Fix $\delta\in(0,1)$ and choose $A$ large enough that
\begin{align*}
A\sigma\sqrt{\frac{\log p}{n}}
\ge
\sigma\sqrt{\frac{2\log(2p/\delta)}{n}}.
\end{align*}
By the Gaussian score calibration above, this choice gives
\begin{align*}
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty
\le
A\sigma\sqrt{\frac{\log p}{n}}
\end{align*}
with probability at least $1-\delta$. On this event, set
\begin{align*}
\lambda=A\sigma\sqrt{\frac{\log p}{n}}.
\end{align*}
The *Dantzig Oracle Inequality under a Restricted Eigenvalue* gives
\begin{align*}
\frac{\|X(\hat\beta_{\mathrm{DS}}-\beta^*)\|_2^2}{n}
\le
\frac{16\lambda^2s}{\kappa(S,1)^2}.
\end{align*}
Substituting the displayed value of $\lambda$,
\begin{align*}
\lambda^2
=
\left(A\sigma\sqrt{\frac{\log p}{n}}\right)^2
=
A^2\sigma^2\frac{\log p}{n}.
\end{align*}
Therefore
\begin{align*}
\frac{16\lambda^2s}{\kappa(S,1)^2}
=
\frac{16A^2\sigma^2s\log p}{n\kappa(S,1)^2},
\end{align*}
so, absorbing the numerical factor $16A^2$ into the universal comparison constant,
\begin{align*}
\frac{\|X(\hat\beta_{\mathrm{DS}}-\beta^*)\|_2^2}{n}
\lesssim
\frac{\sigma^2s\log p}{n\kappa(S,1)^2}.
\end{align*}
The same oracle inequality gives
\begin{align*}
\|\hat\beta_{\mathrm{DS}}-\beta^*\|_1
\le
\frac{8\lambda s}{\kappa(S,1)^2}.
\end{align*}
Substituting $\lambda=A\sigma\sqrt{\log p/n}$ yields
\begin{align*}
\frac{8\lambda s}{\kappa(S,1)^2}
=
\frac{8A\sigma s\sqrt{\log p/n}}{\kappa(S,1)^2},
\end{align*}
and hence
\begin{align*}
\|\hat\beta_{\mathrm{DS}}-\beta^*\|_1
\lesssim
\frac{\sigma s\sqrt{\log p/n}}{\kappa(S,1)^2}.
\end{align*}
Thus the deterministic Dantzig bounds become high-probability oracle rates once $\lambda$ is chosen to dominate the Gaussian score.
[/example]
The chapter closes the constrained-estimation viewpoint developed here. The Dantzig selector shows that sparse recovery can be formulated as approximate satisfaction of the normal equations plus $\ell_1$ minimisation. Chapter 8 next modifies the penalty to reduce shrinkage bias through nonconvex regularisation, and Chapter 9 then prepares the estimator for inference through debiasing.
The Dantzig selector confirms that sparse estimation can be framed either as penalised optimisation or as constrained moment matching, but it does not remove the shrinkage bias inherent in convex ℓ1 methods. The next chapter therefore changes direction and studies nonconvex penalties that aim to preserve sparsity while reducing bias for large signals.
# 8. Nonconvex Regularisation: SCAD and MCP
The Lasso and the Dantzig selector give convex and computationally stable ways to estimate a sparse regression vector, but the Lasso's convex $\ell_1$ penalty comes with shrinkage: every selected coefficient pays the same marginal penalty, no matter how large its signal is. This chapter asks whether we can keep the sparsity-inducing behaviour near zero while reducing the bias on large coefficients. The answer is given by folded-concave penalties, especially SCAD and MCP, whose penalty derivatives decrease to zero after the coefficient is large enough.
The price is nonconvexity. We therefore separate two questions that convex regularisation allowed us to merge: what sparse local optima look like statistically, and whether an algorithm can find a desirable one. The chapter develops the thresholding rules, the oracle-property statements, and the local curvature conditions that make this separation precise.
## Why Folded-Concave Penalties Reduce Shrinkage Bias
The problem with the Lasso is already visible in the one-dimensional least-squares problem. If $z$ is the unregularised least-squares coefficient, the Lasso estimate minimises
\begin{align*}
\frac{1}{2}(u-z)^2+\lambda |u|
\end{align*}
over $u\in\mathbb R$ and equals the soft-thresholding rule, so every nonzero coefficient is shifted toward zero by exactly $\lambda$. A folded-concave penalty keeps the large derivative at the origin that creates sparsity, but makes the derivative smaller for large $|u|$ so that strong signals are not repeatedly penalised.
[definition: Folded-Concave Penalty]
A penalty $p_\lambda:[0,\infty)\to[0,\infty)$ is a folded-concave penalty with tuning parameter $\lambda>0$ if $p_\lambda(0)=0$, $p_\lambda$ is increasing on $[0,\infty)$, $p_\lambda$ is concave on $[0,\infty)$, and $p_\lambda$ has a finite right derivative $p_\lambda'(0+)$ at $0$ and a one-sided derivative $p_\lambda'(t)$ at every $t>0$.
[/definition]
In a separable penalised regression criterion, this penalty is applied to the magnitude of each coefficient through terms $p_\lambda(|\beta_j|)$. The derivative convention in the definition is the one used in the stationarity equations below: at zero we use the right derivative, while at kink points away from zero the relevant one-sided derivative or subderivative interval gives the same coordinate thresholding inequalities. Concavity means that the marginal cost $p_\lambda'(t)$ decreases as $t$ grows. The singular behaviour near zero mimics the Lasso's ability to set coefficients exactly to zero, while the flattening of the derivative controls the bias for large coordinates. To turn this design principle into an estimator, we need a concrete derivative shape; SCAD is the first canonical choice because it keeps a Lasso plateau near zero and then clips the marginal penalty smoothly.
[definition: SCAD Penalty Derivative]
For $\lambda>0$ and $a>2$, the smoothly clipped absolute deviation penalty derivative is the map $p'_{\lambda,a}:[0,\infty)\to[0,\infty)$ specified by
\begin{align*}
p'_{\lambda,a}(t)
&= \lambda \mathbb{1}_{0\le t\le \lambda}
+ \frac{(a\lambda-t)_+}{a-1}\mathbb{1}_{t>\lambda},
\qquad t\ge 0,
\end{align*}
with $p_{\lambda,a}(0)=0$.
[/definition]
Thus SCAD has Lasso-like slope $\lambda$ near zero, a linearly decreasing slope for $\lambda<t<a\lambda$, and no slope once $t\ge a\lambda$. The parameter $a$ controls how long the transition lasts; the common default in applications is approximately $a=3.7$. It is useful to compare SCAD with an even simpler folded-concave derivative, because the strength of the concavity will later determine how much design curvature is needed for stable local minima.
[definition: MCP Penalty Derivative]
For $\lambda>0$ and $\gamma>1$, the minimax concave penalty derivative is the map $p'_{\lambda,\gamma}:[0,\infty)\to[0,\infty)$ specified by
\begin{align*}
p'_{\lambda,\gamma}(t)
&= \left(\lambda-\frac{t}{\gamma}\right)_+,
\qquad t\ge 0,
\end{align*}
with $p_{\lambda,\gamma}(0)=0$.
[/definition]
MCP decreases its derivative immediately from $\lambda$ and becomes flat once $t\ge \gamma\lambda$. Compared with SCAD, the transition is simpler and the maximum concavity is $1/\gamma$, a number that will reappear in local convexity conditions. Before moving to multivariate regression, we test both penalties in the scalar problem where the entire effect of shrinkage can be read from the thresholding map.
[example: One-Dimensional Thresholding Comparison]
Consider
\begin{align*}
Q(u;z)=\frac{1}{2}(u-z)^2+p_\lambda(|u|).
\end{align*}
It is enough to compute the minimiser for $z\ge0$ and $u\ge0$, because $Q(u;z)=Q(-u;-z)$; the general rule is obtained by multiplying the nonnegative solution for $|z|$ by $\operatorname{sgn}(z)$.
For the Lasso penalty $p_\lambda(t)=\lambda t$, the derivative for $u>0$ is
\begin{align*}
\frac{d}{du}\left\{\frac{1}{2}(u-z)^2+\lambda u\right\}=u-z+\lambda.
\end{align*}
The stationary equation $u-z+\lambda=0$ gives $u=z-\lambda$, which is positive exactly when $z>\lambda$. At $u=0$, the right derivative is $-z+\lambda$, so the minimiser is $0$ when $z\le\lambda$. Hence, for $z\ge0$,
\begin{align*}
\hat u_{\mathrm{Lasso}}=(z-\lambda)_+.
\end{align*}
Restoring the sign gives
\begin{align*}
\hat u_{\mathrm{Lasso}}=\operatorname{sgn}(z)(|z|-\lambda)_+.
\end{align*}
When $|z|>\lambda$, this means
\begin{align*}
z-\hat u_{\mathrm{Lasso}}=\lambda\operatorname{sgn}(z).
\end{align*}
Thus the absolute shrinkage bias is always $\lambda$ for every selected scalar coefficient.
For SCAD with $a>2$, again work with $z\ge0$. At $u=0$, the right derivative is $-z+\lambda$, so the update is $0$ for $0\le z\le\lambda$. On $0<u\le\lambda$, the SCAD derivative equals $\lambda$, so
\begin{align*}
Q'(u;z)=u-z+\lambda.
\end{align*}
Solving $Q'(u;z)=0$ gives $u=z-\lambda$, and the condition $0<z-\lambda\le\lambda$ is equivalent to $\lambda<z\le2\lambda$. Thus the SCAD update equals $z-\lambda$ on that interval. On $\lambda<u\le a\lambda$, the derivative is
\begin{align*}
Q'(u;z)=u-z+\frac{a\lambda-u}{a-1}.
\end{align*}
Solving the stationary equation gives
\begin{align*}
u-z+\frac{a\lambda-u}{a-1}=0.
\end{align*}
Multiplying by $a-1$ gives
\begin{align*}
(a-1)u-(a-1)z+a\lambda-u=0.
\end{align*}
Collecting the $u$ terms gives
\begin{align*}
(a-2)u=(a-1)z-a\lambda.
\end{align*}
Since $a>2$, division by $a-2$ gives
\begin{align*}
u=\frac{(a-1)z-a\lambda}{a-2}.
\end{align*}
This candidate is larger than $\lambda$ exactly when $z>2\lambda$, and it is at most $a\lambda$ exactly when $z\le a\lambda$, so the middle SCAD branch applies for $2\lambda<z\le a\lambda$. On $u>a\lambda$, the SCAD derivative is $0$, so
\begin{align*}
Q'(u;z)=u-z.
\end{align*}
The stationary point is $u=z$, and it lies in this region exactly when $z>a\lambda$. Therefore, for general $z$, SCAD returns $0$ when $|z|\le\lambda$, returns $\operatorname{sgn}(z)(|z|-\lambda)$ when $\lambda<|z|\le2\lambda$, returns
\begin{align*}
\operatorname{sgn}(z)\frac{(a-1)|z|-a\lambda}{a-2}
\end{align*}
when $2\lambda<|z|\le a\lambda$, and returns $z$ when $|z|>a\lambda$.
For MCP with $\gamma>1$, the right derivative at $0$ is again $-z+\lambda$, so the update is $0$ for $0\le z\le\lambda$. On $0<u<\gamma\lambda$, the derivative is
\begin{align*}
Q'(u;z)=u-z+\lambda-\frac{u}{\gamma}.
\end{align*}
The stationary equation is
\begin{align*}
u-z+\lambda-\frac{u}{\gamma}=0.
\end{align*}
Collecting the $u$ terms gives
\begin{align*}
\left(1-\frac{1}{\gamma}\right)u=z-\lambda.
\end{align*}
Since $\gamma>1$, the coefficient $1-1/\gamma$ is positive, so
\begin{align*}
u=\frac{z-\lambda}{1-1/\gamma}.
\end{align*}
This candidate is positive exactly when $z>\lambda$. It lies below or at $\gamma\lambda$ precisely when
\begin{align*}
\frac{z-\lambda}{1-1/\gamma}\le\gamma\lambda.
\end{align*}
Multiplying by the positive number $1-1/\gamma=(\gamma-1)/\gamma$ gives
\begin{align*}
z-\lambda\le(\gamma-1)\lambda.
\end{align*}
Equivalently, $z\le\gamma\lambda$. On $u>\gamma\lambda$, the MCP derivative is $0$, so $Q'(u;z)=u-z$ and the stationary point is $u=z$, admissible exactly when $z>\gamma\lambda$. Restoring the sign, MCP returns $0$ when $|z|\le\lambda$, returns
\begin{align*}
\operatorname{sgn}(z)\frac{|z|-\lambda}{1-1/\gamma}
\end{align*}
when $\lambda<|z|\le\gamma\lambda$, and returns $z$ when $|z|>\gamma\lambda$.
The comparison isolates the bias pattern: Lasso subtracts $\lambda$ from every selected coordinate, while SCAD and MCP become exactly unshrunk once $|z|$ passes their flat-penalty thresholds.
[/example]
This comparison isolates the main statistical motivation. Lasso shrinkage is permanent, whereas SCAD and MCP are designed so that sufficiently large empirical signals are left unshrunk. This motivates the following theorem, which records the exact SCAD thresholding map needed for coordinate descent and for the later oracle-property argument.
[quotetheorem:5578]
[citeproof:5578]
The thresholding formula shows that SCAD behaves like a variable-strength shrinkage rule: it selects by thresholding near zero, then gradually turns shrinkage off. The condition $a>2$ is not cosmetic: the middle formula divides by $a-2$, and the scalar objective needs positive effective curvature on the transition interval to make that branch a well-ordered thresholding rule. If $a\le 2$, the transition formula either is not defined or reverses the curvature comparison that orders the middle branch. The normalisation of the scalar quadratic is also part of the statement. If the coordinate loss is $\rho(u-z)^2/2$ with $\rho\ne1$, the first-order equation becomes
\begin{align*}
\rho(u-z)+p'_{\lambda,a}(u)=0.
\end{align*}
Thus the breakpoints and shrinkage factors must be rescaled by $\rho$ rather than copied from the theorem. At the breakpoints $\lambda$, $2\lambda$, and $a\lambda$, the rule is continuous but the derivative changes from one piece of the penalty to another, so numerical implementations should treat boundary comparisons carefully rather than relying on a single smooth first-order equation. The theorem is scalar and separable; in a correlated multivariate design, a coordinate update uses this map only after the other coordinates have been frozen, and the resulting coordinate-wise minimiser need not be a global minimiser of the full nonconvex objective.
[example: Bias Reduction for Large Coefficients]
In the orthonormal design model $X^\top X/n=I_p$, write $z_j=X_j^\top Y/n$. For a single coordinate $u=\beta_j$, with all other coordinates held fixed, let $r_j=Y-\sum_{k\ne j}X_k\beta_k$. The squared-loss part is
\begin{align*}
\frac{1}{2n}|r_j-X_j u|^2=\frac{1}{2n}\left(|r_j|^2-2uX_j^\top r_j+u^2X_j^\top X_j\right).
\end{align*}
Since $X_j^\top X_j/n=1$, this becomes
\begin{align*}
\frac{1}{2n}|r_j-X_j u|^2=\frac{|r_j|^2}{2n}-u\frac{X_j^\top r_j}{n}+\frac{u^2}{2}.
\end{align*}
In the orthonormal separated problem, $X_j^\top r_j/n=z_j$, so up to constants independent of $u$ the coordinate objective is
\begin{align*}
\frac{1}{2}u^2-u z_j+p_\lambda(|u|).
\end{align*}
Completing the square gives
\begin{align*}
\frac{1}{2}u^2-u z_j=\frac{1}{2}(u-z_j)^2-\frac{1}{2}z_j^2.
\end{align*}
Thus the coordinate update minimises the scalar criterion
\begin{align*}
\frac{1}{2}(u-z_j)^2+p_\lambda(|u|).
\end{align*}
Now suppose $z_j=5\lambda$ and SCAD uses $a=3.7$. Since $\lambda>0$, we have $|z_j|=5\lambda$ and $a\lambda=3.7\lambda$, hence
\begin{align*}
|z_j|=5\lambda>3.7\lambda=a\lambda.
\end{align*}
On the flat SCAD region $|u|>a\lambda$, the penalty derivative is $0$, so for the positive coordinate branch the first-order equation is
\begin{align*}
0=u-z_j+0.
\end{align*}
Substituting $z_j=5\lambda$ gives
\begin{align*}
0=u-5\lambda.
\end{align*}
Therefore $u=5\lambda=z_j$, and this candidate is admissible because $5\lambda>3.7\lambda$. The SCAD coordinate update returns $z_j$ with no shrinkage.
For the Lasso penalty $p_\lambda(t)=\lambda t$, the positive-coordinate first-order equation is
\begin{align*}
0=u-z_j+\lambda.
\end{align*}
Substituting $z_j=5\lambda$ gives
\begin{align*}
0=u-5\lambda+\lambda=u-4\lambda.
\end{align*}
Thus the Lasso update is $u=4\lambda=z_j-\lambda$. In the noiseless orthonormal calculation, where the unpenalised scalar coefficient is $z_j$, the absolute shrinkage is
\begin{align*}
|z_j-u|=|5\lambda-4\lambda|=\lambda.
\end{align*}
This is the basic sense in which folded-concave penalties target the Lasso's large-signal bias: once the signal enters the flat part of SCAD, the coordinate update is no longer pulled toward zero.
[/example]
## Local Minima and Coordinate Descent
The next issue is algorithmic rather than statistical: nonconvex objectives can have many stationary points. The course's strategy is not to claim that every local minimum is good, but to identify a sparse region where the loss has enough curvature to overcome the concavity of the penalty. Within that region, stationary points obey useful inequalities and coordinate descent has a meaningful target.
[definition: Nonconvex Penalised Least-Squares Criterion]
For fixed design $X\in\mathbb R^{n\times p}$, response $Y\in\mathbb R^n$, and penalty $p_\lambda$, define the criterion $Q:\mathbb R^p\to\mathbb R$ by
\begin{align*}
Q(\beta)
&=\frac{1}{2n}|Y-X\beta|^2+
\sum_{j=1}^p p_\lambda(|\beta_j|),
\qquad \beta\in\mathbb R^p.
\end{align*}
[/definition]
A point can be locally optimal even when the full objective is not convex. To identify which local optima are compatible with the data and penalty, we need the stationarity equations: they replace the Lasso subgradient conditions and expose how the penalty derivative changes across active coordinates.
[definition: Stationary Point for a Folded-Concave Criterion]
A vector $\hat\beta\in\mathbb R^p$ is a stationary point of $Q$ if, for every $j\in\{1,\dots,p\}$,
\begin{align*}
\frac{1}{n}X_j^\top(Y-X\hat\beta)
&=p_\lambda'(|\hat\beta_j|)\operatorname{sgn}(\hat\beta_j)
\quad\text{when }\hat\beta_j\ne0,
\end{align*}
and
\begin{align*}
\left|\frac{1}{n}X_j^\top(Y-X\hat\beta)\right|&\le p_\lambda'(0+)
\quad\text{when }\hat\beta_j=0.
\end{align*}
[/definition]
These are the KKT-style equations for a nonconvex separable penalty. They look similar to the Lasso equations, but the active-coordinate shrinkage is now weighted by $p_\lambda'(|\hat\beta_j|)$ rather than the constant $\lambda$. The same one-coordinate structure suggests a practical algorithm: update a single coefficient by solving the scalar penalised problem while holding all other coordinates fixed.
[definition: Coordinate Descent Update]
A coordinate descent step for $Q$ chooses an index $j$, holds $\beta_k$ fixed for $k\ne j$, and replaces $\beta_j$ by a minimiser of the resulting one-dimensional function
\begin{align*}
\varphi_j:\mathbb R&\to\mathbb R
\end{align*}
given by
\begin{align*}
\varphi_j(u)
&= \frac{1}{2n}|r_j-X_j u|^2+p_\lambda(|u|),
\end{align*}
where $r_j=Y-\sum_{k\ne j}X_k\beta_k$.
[/definition]
When the design columns are normalised by $|X_j|^2/n=1$, this coordinate update has the same thresholding form as the one-dimensional rules, with $z_j=X_j^\top r_j/n$. Coordinate descent is therefore computationally attractive: each step is cheap and uses the same shrinkage rule that explains the statistical bias reduction. The remaining issue is when the nonconvex criterion is locally well behaved around sparse iterates, which is governed by restricted curvature minus penalty concavity.
[quotetheorem:5579]
[citeproof:5579]
This theorem is local because the curvature hypothesis is imposed only on sparse directions, not on all of $\mathbb R^p$. That is exactly the high-dimensional situation: if $p>n$, the quadratic loss cannot be globally strongly convex, but it can still be strongly convex on low-dimensional sparse subspaces. The inequality $\kappa>\zeta$ is necessary for this argument: if the penalty curvature loss matches or exceeds the design curvature, the net second-order lower bound can be zero or negative, and two separated stationary points may appear on the same sparse coordinate line. Sparsity is also essential; when the union support is larger than $m$, the design may contain nearly null directions, so the squared loss gives too little curvature to offset the folded-concave penalty. Thus the theorem does not control dense local minima or iterates that wander outside the sparse region.
[remark: Statistical and Computational Tradeoff]
Folded-concave penalties improve the statistical target by reducing shrinkage bias, but they make the optimisation landscape more complicated. Stronger concavity, such as smaller $\gamma$ for MCP, gives faster bias reduction but demands stronger design curvature for local convexity. The tuning of the penalty therefore balances estimation bias, variable-selection sharpness, and algorithmic stability.
[/remark]
## Oracle Properties and Sparse Local Optima
The final question is whether a nonconvex regularised estimator can behave as if the true support were known in advance. This is the oracle benchmark: recover the correct active set and estimate the nonzero coefficients at the same first-order rate as least squares on that active set. Folded-concave penalties can achieve this under beta-min and regularity assumptions because large true coefficients fall into the flat part of the penalty.
[definition: Oracle Estimator]
Let $S=\{j:\beta_j^*\ne0\}$ with $|S|=s$. The oracle least-squares estimator is the vector $\hat\beta^{\mathrm{or}}\in\mathbb R^p$ satisfying $\hat\beta^{\mathrm{or}}_{S^c}=0$ and
\begin{align*}
\hat\beta^{\mathrm{or}}_S
&=\arg\min_{b\in\mathbb R^s}\frac{1}{2n}|Y-X_S b|^2.
\end{align*}
[/definition]
The oracle estimator is not available in practice because it uses the unknown support $S$. It is nevertheless the right comparison object for variable-selection methods: if an estimator equals the oracle estimator with high probability, it has recovered the support and avoided unnecessary shrinkage on the active coordinates. Such a statement cannot hold when true coefficients are too close to zero, so the next assumption records the minimum signal strength needed to distinguish active variables from noise.
[definition: Beta-Min Condition]
For a threshold $b_n>0$, the vector $\beta^*\in\mathbb R^p$ satisfies the beta-min condition at level $b_n$ if
\begin{align*}
\min_{j\in S}|\beta_j^*|\ge b_n,
\qquad S=\{j:\beta_j^*\ne0\}.
\end{align*}
[/definition]
The beta-min condition is the formal way to say that nonzero signals are separated from noise. For SCAD, the natural scale is $a\lambda$: active coefficients should be large enough that the penalty derivative vanishes near the oracle estimate. This is the mechanism behind the oracle theorem: once the active coordinates lie in the flat region and the inactive score is below the threshold, the oracle estimator satisfies the nonconvex stationarity equations.
[quotetheorem:5580]
[citeproof:5580]
This result is not a promise that arbitrary optimisation initialisations find the oracle solution. It says that the statistical objective contains a desirable sparse local optimum, and that this optimum has the same first-order behaviour as the estimator that knew the true support beforehand. The noise-score assumption is needed because an inactive coordinate with $|X_j^\top\varepsilon/n|$ above the threshold can satisfy the entry condition even when $\beta_j^*=0$. The inactive-active correlation condition prevents the fitted active residual from creating a large score on $S^c$; for example, if some $X_j$ with $j\in S^c$ is nearly collinear with a column in $X_S$, then the inactive score can remain above $\lambda$ after fitting the oracle model, so that variable can enter as a surrogate even when the active coefficients satisfy beta-min. The beta-min condition keeps the oracle active coefficients inside the flat part of SCAD after estimation error is accounted for, while restricted convexity rules out nearby sparse competitors. If any of these ingredients fails, the oracle estimator may still be a useful comparison object, but it need not be a local minimiser of the SCAD criterion.
[example: Oracle Behaviour in an Orthonormal Design]
Assume $X^\top X/n=I_p$ and $Y=X\beta^*+\varepsilon$. For each coordinate $j$,
\begin{align*}
z_j=\frac{1}{n}X_j^\top Y
\end{align*}
and substituting the model gives
\begin{align*}
z_j=\frac{1}{n}X_j^\top(X\beta^*+\varepsilon).
\end{align*}
By linearity,
\begin{align*}
z_j=\sum_{k=1}^p \frac{1}{n}X_j^\top X_k\beta_k^*+\frac{1}{n}X_j^\top\varepsilon.
\end{align*}
Since $X^\top X/n=I_p$, we have $X_j^\top X_k/n=1$ when $j=k$ and $X_j^\top X_k/n=0$ when $j\ne k$, so
\begin{align*}
z_j=\beta_j^*+\frac{1}{n}X_j^\top\varepsilon.
\end{align*}
Let $S=\{j:\beta_j^*\ne0\}$ and suppose
\begin{align*}
\max_{1\le j\le p}\left|\frac{1}{n}X_j^\top\varepsilon\right|\le\lambda
\end{align*}
and
\begin{align*}
\min_{j\in S}|\beta_j^*|>(a+1)\lambda.
\end{align*}
If $j\in S^c$, then $\beta_j^*=0$, hence
\begin{align*}
|z_j|=\left|\frac{1}{n}X_j^\top\varepsilon\right|.
\end{align*}
The score bound gives
\begin{align*}
|z_j|\le\lambda.
\end{align*}
By the *[SCAD Thresholding Rule](/theorems/5578)*, the SCAD coordinate update is therefore $\hat\beta_j=0$ for every $j\in S^c$.
If $j\in S$, then the reverse triangle inequality gives
\begin{align*}
|z_j|\ge |\beta_j^*|-\left|\frac{1}{n}X_j^\top\varepsilon\right|.
\end{align*}
Using the beta-min bound and the score bound,
\begin{align*}
|z_j|>(a+1)\lambda-\lambda.
\end{align*}
Therefore
\begin{align*}
|z_j|>a\lambda.
\end{align*}
By the flat branch of the *SCAD Thresholding Rule*, the SCAD coordinate update is $\hat\beta_j=z_j$ for every $j\in S$.
In the same orthonormal design, least squares on the true support satisfies
\begin{align*}
\hat\beta^{\mathrm{or}}_S=\left(\frac{1}{n}X_S^\top X_S\right)^{-1}\frac{1}{n}X_S^\top Y.
\end{align*}
Since $X^\top X/n=I_p$, we have $X_S^\top X_S/n=I_s$, so
\begin{align*}
\hat\beta^{\mathrm{or}}_S=I_s^{-1}\frac{1}{n}X_S^\top Y.
\end{align*}
Because $I_s^{-1}=I_s$,
\begin{align*}
\hat\beta^{\mathrm{or}}_S=\frac{1}{n}X_S^\top Y=z_S.
\end{align*}
Also $\hat\beta^{\mathrm{or}}_{S^c}=0$ by definition of the oracle estimator. Hence the SCAD update equals the oracle least-squares estimator: it returns $z_j$ on $S$ and $0$ on $S^c$. The example shows that, in the simplest geometry, score control removes inactive variables while beta-min places active variables beyond SCAD's flat-penalty threshold.
[/example]
The orthonormal example makes the oracle phenomenon look like direct thresholding, while the general theorem needs design assumptions to control correlations between active and inactive variables. This distinction motivates a cautious reading of oracle claims in applied work.
[remark: What the Oracle Property Does Not Say]
The oracle property is an asymptotic or high-probability local statement under strong signal and design assumptions. It does not remove the need to choose $\lambda$, check sensitivity to initialisation, or report uncertainty about model selection. In finite samples, a convex method with a larger bias may be preferable when optimisation stability is the main constraint.
[/remark]
The conceptual endpoint of the chapter is a three-way comparison. Lasso is convex and stable but biased on large coefficients. SCAD and MCP reduce large-signal bias and can have oracle sparse local optima, but their nonconvexity shifts attention to local curvature, initialisation, and coordinate-wise algorithms. Chapter 9 uses this distinction again when discussing debiasing and post-selection inference: regularisation can select a useful model, but estimation and inference often require a second look at the shrinkage it introduced.
Nonconvex regularisation shows that bias reduction comes at the price of more delicate optimisation and local analysis, so estimation theory is no longer enough on its own. The next chapter uses this distinction to move from point estimation to inference, where debiasing is needed to recover asymptotically valid uncertainty statements.
# 9. Debiased Lasso and High-Dimensional Inference
Chapters 2 through 4 developed the Lasso as an estimator: it gives sharp prediction and estimation bounds when the regression vector is sparse and the design satisfies restricted eigenvalue-type conditions. This chapter changes the question from estimation to inference. We want confidence intervals and tests for individual coordinates of $\beta^*$ even when $p$ may be much larger than $n$, where classical least-squares standard errors are unavailable.
The main obstruction is that the Lasso is deliberately biased. Its shrinkage stabilises estimation but distorts the distribution of each coordinate, and its data-dependent selected model cannot be treated as if it had been fixed before seeing the data. The debiased Lasso corrects the leading shrinkage bias by multiplying the score equation by an approximate inverse Gram matrix, built through nodewise regressions.
This construction also connects high-dimensional statistics with two wider themes. From semiparametric inference, it borrows the idea of an influence function: a biased preliminary estimator is corrected by adding an empirical score in a direction that is approximately orthogonal to nuisance errors. From graphical modelling, nodewise regression estimates rows of a precision matrix, so inference for a regression coefficient depends on the conditional dependence structure among covariates as well as on sparsity of the signal.
## Why Naive Lasso Confidence Intervals Fail
The first question is why the usual recipe, "estimate a coefficient and attach a standard error", breaks down for the Lasso. The answer has two parts: the $\ell_1$ penalty shrinks active coefficients toward zero, and the active set is itself a random object chosen by the same data used for inference. Both effects remain visible at the $n^{-1/2}$ scale unless they are corrected.
We work in the fixed-design linear model
\begin{align*}
Y = X\beta^* + \varepsilon,
\end{align*}
where $X \in \mathbb R^{n \times p}$ has columns normalised by $\|X_j\|_2^2/n = 1$, $\beta^* \in \mathbb R^p$ is sparse, and $\varepsilon \sim \mathcal N(0,\sigma^2 I_n)$ unless stated otherwise. The Lasso estimator is
\begin{align*}
\hat\beta \in \operatorname*{argmin}_{\beta \in \mathbb R^p}\left\{\frac{1}{2n}\|Y-X\beta\|_2^2 + \lambda \|\beta\|_1\right\}.
\end{align*}
The tuning parameter is usually chosen at the scale $\lambda \asymp \sigma\sqrt{\log p/n}$, which is large enough to dominate the random score $X^\top\varepsilon/n$ uniformly over $p$ coordinates.
The KKT equations expose the source of the bias. If $\hat z \in \partial \|\hat\beta\|_1$, then
\begin{align*}
\frac{1}{n}X^\top(Y-X\hat\beta) = \lambda \hat z.
\end{align*}
Substituting $Y=X\beta^*+\varepsilon$ gives
\begin{align*}
\hat\Sigma(\hat\beta-\beta^*) = \frac{1}{n}X^\top\varepsilon - \lambda \hat z, \qquad \hat\Sigma := \frac{1}{n}X^\top X.
\end{align*}
The stochastic score has the shape needed for asymptotic inference, but the penalty term $-\lambda\hat z$ is of the same order as the score in the sup norm. A coordinate of $\hat\beta-\beta^*$ is therefore not centred around zero in the way ordinary least squares is.
[example: Orthogonal Design Bias]
Assume $p\le n$ and $\hat\Sigma=X^\top X/n=I_p$. Put
\begin{align*}
a_j:=\frac{X_j^\top Y}{n}.
\end{align*}
Since $Y=X\beta^*+\varepsilon$ and $X_j^\top X_k/n=\mathbf 1_{\{j=k\}}$,
\begin{align*}
a_j
=\frac{X_j^\top X\beta^*}{n}+\frac{X_j^\top\varepsilon}{n}
=\sum_{k=1}^p\frac{X_j^\top X_k}{n}\beta_k^*+\frac{X_j^\top\varepsilon}{n}
=\beta_j^*+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
The Lasso objective differs from
\begin{align*}
\sum_{j=1}^p\left\{\frac{1}{2}\beta_j^2-a_j\beta_j+\lambda|\beta_j|\right\}
=\sum_{j=1}^p\left\{\frac{1}{2}(\beta_j-a_j)^2+\lambda|\beta_j|\right\}
-\frac{1}{2}\sum_{j=1}^p a_j^2
\end{align*}
only by a constant independent of $\beta$, so it minimises separately in each coordinate. Hence
\begin{align*}
\hat\beta_j
=S_\lambda(a_j)
=S_\lambda\left(\beta_j^*+\frac{X_j^\top\varepsilon}{n}\right),
\qquad
S_\lambda(t)=\operatorname{sgn}(t)(|t|-\lambda)_+.
\end{align*}
Now suppose $\beta_j^*>\lambda$ and write $w_j:=X_j^\top\varepsilon/n$. On the event $|w_j|<\beta_j^*-\lambda$, we have
\begin{align*}
\beta_j^*+w_j>\beta_j^*-(\beta_j^*-\lambda)=\lambda,
\end{align*}
so the soft-thresholding formula is in its positive active branch:
\begin{align*}
\hat\beta_j
=S_\lambda(\beta_j^*+w_j)
=(\beta_j^*+w_j)-\lambda.
\end{align*}
Therefore
\begin{align*}
\hat\beta_j-\beta_j^*
=(\beta_j^*+w_j-\lambda)-\beta_j^*
=w_j-\lambda
=\frac{X_j^\top\varepsilon}{n}-\lambda.
\end{align*}
The stochastic term has ordinary standard deviation $\sigma/\sqrt n$ because $\|X_j\|_2^2/n=1$, while the deterministic shrinkage term is $\lambda$; at the usual choice $\lambda\asymp\sigma\sqrt{\log p/n}$, an interval centred at $\hat\beta_j$ is centred below the target by a penalty-size amount rather than by random noise alone.
[/example]
This example is deliberately favourable: the design is orthogonal, so there is no high-dimensional collinearity to blame. The failure is intrinsic to shrinkage. In correlated designs an additional complication appears, because the selected variables depend on noise correlations across the full design matrix.
[remark: Selection Is Not Conditioning On A Fixed Model]
After Lasso selection, it is tempting to refit ordinary least squares on $\hat S=\{j:\hat\beta_j\ne 0\}$ and use the usual low-dimensional confidence intervals. Those intervals condition on the event that the model was fixed before observing $Y$. Since $\hat S$ is a function of $Y$, the distribution of the refitted estimator is truncated and shifted by the selection event unless a selective-inference correction is made.
[/remark]
The debiased Lasso takes a different route from post-selection inference. It does not try to condition on the selected model. Instead, it constructs a new estimator whose leading term is an average of independent noise variables and whose remaining bias is small under sparsity assumptions.
## Nodewise Regression and Approximate Inverse Gram Matrices
To remove the Lasso bias, we would like to multiply the score equation by $\hat\Sigma^{-1}$. When $p>n$, the empirical Gram matrix $\hat\Sigma$ is singular, and even when $p<n$ its inverse may be unstable. The guiding problem is therefore to construct enough of an inverse to recover the coordinate directions needed for inference.
For a fixed coordinate $j$, write $X_j\in\mathbb R^n$ for the $j$th column and $X_{-j}$ for the matrix obtained by deleting that column. Nodewise regression regresses each column of $X$ on the remaining columns using the Lasso. This estimates the residual part of $X_j$ not explained by $X_{-j}$, and that residual becomes the instrument used to isolate $\beta_j^*$.
[definition: Nodewise Lasso Regression]
For each $j\in\{1,\dots,p\}$, the nodewise Lasso coefficient vector is any solution
\begin{align*}
\hat\gamma_j \in \operatorname*{argmin}_{\gamma\in\mathbb R^{p-1}}\left\{\frac{1}{2n}\|X_j-X_{-j}\gamma\|_2^2+\lambda_j\|\gamma\|_1\right\}.
\end{align*}
The nodewise residual and residual variance estimate are
\begin{align*}
\hat r_j := X_j-X_{-j}\hat\gamma_j, \qquad
\hat\tau_j^2 := \frac{1}{n}\|\hat r_j\|_2^2 + \lambda_j\|\hat\gamma_j\|_1.
\end{align*}
[/definition]
The nodewise regression produces residuals, but the debiasing formula needs row vectors that can multiply the Lasso score. We therefore normalise and embed each residual direction into a $p$-dimensional vector, chosen so that its product with the Gram matrix mimics a coordinate vector.
[definition: Nodewise Approximate Inverse]
Embed $\hat\gamma_j$ into a vector $\hat c_j\in\mathbb R^p$ by setting $(\hat c_j)_j=1$ and $(\hat c_j)_k=-(\hat\gamma_j)_k$ for $k\ne j$, with the natural indexing inherited from $X_{-j}$. The $j$th row of the nodewise approximate inverse is
\begin{align*}
\hat\theta_j^\top := \frac{\hat c_j^\top}{\hat\tau_j^2}.
\end{align*}
The matrix $\hat\Theta\in\mathbb R^{p\times p}$ has rows $\hat\theta_1^\top,\dots,\hat\theta_p^\top$.
[/definition]
The point of this construction is not to invert $\hat\Sigma$ exactly. It is to make $\hat\Theta\hat\Sigma$ close to $I_p$ in a norm strong enough to control sparse errors. Write $e_j\in\mathbb R^p$ for the $j$th standard basis vector. The next result records the deterministic KKT consequence that supplies this control.
[quotetheorem:5581]
[citeproof:5581]
This theorem is the algebraic bridge between sparse estimation and inference. The residual variance $\hat\tau_j^2$ matters because it is the normalising factor that turns the nodewise residual into an approximate inverse direction; if $X_j$ lies nearly in the span of $X_{-j}$, then $\hat r_j$ and $\hat\tau_j^2$ can be very small, so the row $\hat\theta_j$ becomes unstable and the bound gives little useful control. The nodewise tuning term is also necessary: with $\lambda_j=0$ in a singular high-dimensional design, the residual equations need not select a stable sparse direction, and the attempted inverse can have large coordinates. The result does not say that $\hat\Theta$ is a genuine inverse of $\hat\Sigma$, nor that every coordinate can be estimated well without structural assumptions on the design. What it gives is exactly the deterministic control needed later: if $\hat\beta-\beta^*$ is small in $\ell_1$ norm, then the sup-norm error in $\hat\Theta\hat\Sigma-I_p$ produces a small scalar bias for each coordinate.
[example: Precision Row Sparsity]
Let $Z=(Z_1,\dots,Z_p)$ denote a random design row with covariance $\Sigma=\mathbb E[ZZ^\top]$, and fix a coordinate $j$. The population regression coefficient of $Z_j$ on $Z_{-j}$ is
\begin{align*}
\gamma_j^{\mathrm{pop}}\in \operatorname*{argmin}_{\gamma\in\mathbb R^{p-1}}\mathbb E\left[(Z_j-Z_{-j}^\top\gamma)^2\right].
\end{align*}
Expanding the objective term by term gives
\begin{align*}
\mathbb E\left[(Z_j-Z_{-j}^\top\gamma)^2\right]=\mathbb E[Z_j^2]-2\gamma^\top\mathbb E[Z_{-j}Z_j]+\gamma^\top\mathbb E[Z_{-j}Z_{-j}^\top]\gamma.
\end{align*}
Using the block notation for $\Sigma$, this is
\begin{align*}
\mathbb E\left[(Z_j-Z_{-j}^\top\gamma)^2\right]=\Sigma_{jj}-2\gamma^\top\Sigma_{-j,j}+\gamma^\top\Sigma_{-j,-j}\gamma.
\end{align*}
Differentiating this quadratic in $\gamma$ gives
\begin{align*}
-2\Sigma_{-j,j}+2\Sigma_{-j,-j}\gamma_j^{\mathrm{pop}}=0,
\end{align*}
and therefore
\begin{align*}
\Sigma_{-j,-j}\gamma_j^{\mathrm{pop}}=\Sigma_{-j,j}.
\end{align*}
Now write $\Theta=\Sigma^{-1}$. Since $\Sigma\Theta=I_p$, the off-diagonal block of the $j$th column identity is
\begin{align*}
\Sigma_{-j,j}\Theta_{jj}+\Sigma_{-j,-j}\Theta_{-j,j}=0.
\end{align*}
Because $\Theta_{jj}\ne0$, subtracting $\Sigma_{-j,-j}\Theta_{-j,j}$ from both sides and dividing by $\Theta_{jj}$ gives
\begin{align*}
\Sigma_{-j,j}=-\Sigma_{-j,-j}\frac{\Theta_{-j,j}}{\Theta_{jj}}.
\end{align*}
Equivalently,
\begin{align*}
\Sigma_{-j,-j}\left(-\frac{\Theta_{-j,j}}{\Theta_{jj}}\right)=\Sigma_{-j,j}.
\end{align*}
Comparing this equation with the population normal equation yields
\begin{align*}
\gamma_j^{\mathrm{pop}}=-\frac{\Theta_{-j,j}}{\Theta_{jj}}.
\end{align*}
Thus the population regression of $Z_j$ on $Z_{-j}$ has sparse coefficients exactly when the off-diagonal part of the $j$th precision column is sparse.
Embed this population regression coefficient into $c_j^{\mathrm{pop}}\in\mathbb R^p$ by setting $(c_j^{\mathrm{pop}})_j=1$ and $(c_j^{\mathrm{pop}})_k=-(\gamma_j^{\mathrm{pop}})_k$ for $k\ne j$, with the natural indexing inherited from $Z_{-j}$. The identity for $\gamma_j^{\mathrm{pop}}$ gives, coordinate by coordinate,
\begin{align*}
(c_j^{\mathrm{pop}})_j=1=\frac{\Theta_{jj}}{\Theta_{jj}}.
\end{align*}
For $k\ne j$,
\begin{align*}
(c_j^{\mathrm{pop}})_k=-(\gamma_j^{\mathrm{pop}})_k=\frac{\Theta_{kj}}{\Theta_{jj}}.
\end{align*}
Hence
\begin{align*}
c_j^{\mathrm{pop}}=\frac{\Theta_{\cdot j}}{\Theta_{jj}}.
\end{align*}
So nodewise Lasso is estimating a sparse population regression whose embedded direction is a rescaled precision column. This is why sparsity of the precision rows is the structural condition that lets the empirical nodewise vector $\hat\theta_j$ isolate the coordinate $\beta_j^*$ even when $p\gg n$.
[/example]
The sparsity required for inference is therefore not only sparsity of $\beta^*$. It also asks that the covariate whose coefficient we want to test can be approximately separated from the other covariates by a sparse regression.
## Debiased Estimator and Its Expansion
We now ask how to turn the approximate inverse into an estimator with a tractable distribution. The KKT identity shows that the Lasso error is distorted by the Gram matrix and by the penalty. Applying $\hat\Theta X^\top/n$ to the residual supplies the correction term.
[definition: Debiased Lasso Estimator]
Given observed data $(X,Y)\in\mathbb R^{n\times p}\times\mathbb R^n$, a Lasso fit $\hat\beta\in\mathbb R^p$, and an approximate inverse matrix $\hat\Theta\in\mathbb R^{p\times p}$, the debiasing map
\begin{align*}
D_{\mathrm{db}}:\mathbb R^{n\times p}\times\mathbb R^n\times\mathbb R^p\times\mathbb R^{p\times p}\to\mathbb R^p
\end{align*}
is defined by
\begin{align*}
D_{\mathrm{db}}(X,Y,\hat\beta,\hat\Theta)
:= \hat\beta + \hat\Theta\frac{X^\top(Y-X\hat\beta)}{n}.
\end{align*}
The debiased Lasso estimator is $\hat b:=D_{\mathrm{db}}(X,Y,\hat\beta,\hat\Theta)$.
Its $j$th coordinate is
\begin{align*}
\hat b_j = \hat\beta_j + \hat\theta_j^\top\frac{X^\top(Y-X\hat\beta)}{n}.
\end{align*}
[/definition]
The correction adds back the empirical score evaluated at the biased Lasso fit. The inferential question is whether this added score cancels the Lasso shrinkage bias or merely rewrites it. An exact decomposition isolates the random score that should drive a confidence interval from the deterministic remainder that must later be shown negligible.
[quotetheorem:5582]
[citeproof:5582]
The expansion separates the estimator into a leading score term and a remainder, but at this point it is only an algebraic identity. It does not by itself imply asymptotic normality, because the remainder may still be of the same order as the score if the approximate inverse is poor or the Lasso error is too large. For example, taking $\hat\Theta=0$ makes the stochastic score vanish and leaves $\hat b=\hat\beta$, so the original Lasso shrinkage bias remains. Likewise, if $\hat\beta$ has large $\ell_1$ error, a small average prediction error is not enough to control the coordinate remainder. Nor does the expansion provide valid confidence intervals, since the variance of the score and the size of the bias must still be controlled. The next estimate gives the deterministic inequality that turns the expansion into an inferential statement under sparsity assumptions.
[quotetheorem:5583]
[citeproof:5583]
This is the core high-dimensional inference condition. The sup-norm approximate inverse bound is needed because the debiasing error is tested against an $\ell_1$ estimation error: if one coordinate of $e_j^\top(I_p-\hat\Theta\hat\Sigma)$ is of constant size and the Lasso error is concentrated on that coordinate, the scalar remainder can be of the same order as the estimation error rather than the smaller product rate. The Lasso $\ell_1$ rate is equally important: prediction consistency alone would not control the inner product in the remainder term, since two coefficient vectors can have similar fitted values under a correlated design while differing substantially in $\ell_1$ norm. The lower bound on $\hat\tau_j^2$ rules out near-collinearity of $X_j$ with the remaining columns; without it, the normalising denominator can amplify nodewise noise. The condition $s\log p/\sqrt n\to0$ says that the remaining bias is smaller than the $n^{-1/2}$ stochastic fluctuation; if it fails, the normal score may be centred at the wrong value and confidence intervals can undercover. Estimation alone usually needs only $s\log p/n\to0$, so inference imposes a genuinely stronger sparsity regime. This remainder estimate combines the Chapter 4 $\ell_1$ Lasso rate with the nodewise inverse bound, and is the final deterministic input for the asymptotic normality theorem.
## Asymptotic Normality and Confidence Intervals
The remaining question is how to convert the expansion into confidence intervals. Once the remainder is negligible, the coordinate error is governed by $\hat\theta_j^\top X^\top\varepsilon/n$, a weighted sum of the noise variables.
For fixed design and Gaussian noise, conditional on $X$,
\begin{align*}
\hat\theta_j^\top\frac{X^\top\varepsilon}{n}
= \frac{1}{n}(X\hat\theta_j)^\top\varepsilon
\sim \mathcal N\left(0,\frac{\sigma^2}{n}\hat\theta_j^\top\hat\Sigma\hat\theta_j\right).
\end{align*}
The nodewise construction gives a computable variance factor. Since the noise level is usually unknown, the inferential statistic needs a plug-in standard error combining the estimated variance factor with an estimator of $\sigma$.
[definition: Debiased Standard Error]
Let $\hat\sigma\in\mathbb R_+$ be an estimator of the noise level $\sigma$. For each coordinate $j$, the standard-error map
\begin{align*}
S_j:\mathbb R_+\times\mathbb R^{p\times p}\times\mathbb R^{p\times p}\to\mathbb R_+
\end{align*}
is defined by
\begin{align*}
S_j(\hat\sigma,\hat\Theta,\hat\Sigma)
:= \hat\sigma\left(\frac{\hat\theta_j^\top\hat\Sigma\hat\theta_j}{n}\right)^{1/2}.
\end{align*}
The debiased standard error is $\widehat{\operatorname{se}}_j:=S_j(\hat\sigma,\hat\Theta,\hat\Sigma)$.
[/definition]
The standard error formula uses the empirical variance of the score direction $X\hat\theta_j$. It is useful only if the normal approximation survives the plug-in estimates, so the next result collects the sparsity and stability assumptions under which the normalised coordinate statistic has a standard normal limit.
[quotetheorem:5584]
[citeproof:5584]
The theorem justifies using normal quantiles for a fixed coordinate, but each assumption has a distinct role. The restricted eigenvalue and $\ell_1$ oracle assumptions control the initial Lasso error; when they fail, for instance under nearly duplicated columns, the Lasso can move mass between correlated variables and leave a large coordinate-level bias. The nodewise inverse and residual-variance assumptions control the debiasing direction; if $X_j$ is almost a linear combination of many other columns, the variance factor can inflate and the approximate inverse row no longer isolates the $j$th coordinate. The nondegenerate variance condition rules out a statistic normalised by a vanishing or exploding scale. The sparsity condition $s_n\log p_n/\sqrt n\to0$ is stronger than the estimation condition because inference needs the bias to be negligible relative to the standard error, not merely smaller than the signal in prediction norm.
The conclusion is asymptotic and fixed-coordinate. It also depends on consistent noise estimation when $\sigma$ is unknown; a poor estimate of $\sigma$ rescales the statistic and invalidates the nominal coverage. The result does not justify selecting the coordinate after looking at the largest statistic, nor does it give simultaneous coverage over many coordinates without multiplicity correction. With a normal limit and a feasible standard error in hand for a pre-specified coordinate, the natural next object is the interval obtained by centring at the debiased coordinate and using Gaussian critical values.
[definition: Debiased Lasso Confidence Interval]
For nominal level $1-\alpha$ with $\alpha\in(0,1)$, define the interval-valued map
\begin{align*}
C_j:\mathbb R\times\mathbb R_+\times(0,1)\to\{[a,b]\subset\mathbb R:a\le b\}
\end{align*}
by
\begin{align*}
C_j(\hat b_j,\widehat{\operatorname{se}}_j,\alpha)
:=\left[\hat b_j-z_{1-\alpha/2}\widehat{\operatorname{se}}_j,\,\hat b_j+z_{1-\alpha/2}\widehat{\operatorname{se}}_j\right],
\end{align*}
where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$ quantile of $\mathcal N(0,1)$. The two-sided debiased Lasso confidence interval for $\beta_j^*$ is $\operatorname{CI}_j(1-\alpha):=C_j(\hat b_j,\widehat{\operatorname{se}}_j,\alpha)$.
[/definition]
The interval is centred at $\hat b_j$, not at the original Lasso coordinate. This distinction matters most when the original Lasso estimate has been shrunk to zero: a variable can have $\hat\beta_j=0$ while the debiased estimate produces a nonzero test statistic. Write $\Phi$ for the distribution function of a standard normal random variable.
[example: Testing A Single Covariate]
To test $H_0:\beta_j^*=0$ against $H_1:\beta_j^*\ne0$, compute the debiased statistic
\begin{align*}
T_j := \frac{\hat b_j}{\widehat{\operatorname{se}}_j}.
\end{align*}
Under $H_0$, $\beta_j^*=0$, so
\begin{align*}
T_j=\frac{\hat b_j-\beta_j^*}{\widehat{\operatorname{se}}_j}.
\end{align*}
By the coordinate asymptotic normality result, with a consistent plug-in estimate of $\sigma$ in $\widehat{\operatorname{se}}_j$,
\begin{align*}
\frac{\hat b_j-\beta_j^*}{\widehat{\operatorname{se}}_j}\xrightarrow{d}\mathcal N(0,1).
\end{align*}
Thus, under the null, $T_j$ is asymptotically standard normal.
For an observed value $t_j$, let $Z\sim\mathcal N(0,1)$. The two-sided normal tail probability is
\begin{align*}
\mathbb P(|Z|\ge |t_j|)=\mathbb P(Z\ge |t_j|)+\mathbb P(Z\le -|t_j|).
\end{align*}
Using $\mathbb P(Z\ge x)=1-\Phi(x)$ and the symmetry identity $\Phi(-x)=1-\Phi(x)$ for $x\ge0$, this becomes
\begin{align*}
\mathbb P(|Z|\ge |t_j|)=2\left(1-\Phi(|t_j|)\right).
\end{align*}
Therefore the two-sided asymptotic $p$-value is
\begin{align*}
p_j=2\left(1-\Phi(|T_j|)\right).
\end{align*}
Since $z_{1-\alpha/2}$ is defined by $\Phi(z_{1-\alpha/2})=1-\alpha/2$, the inequality $p_j<\alpha$ is equivalent to
\begin{align*}
|T_j|>z_{1-\alpha/2}.
\end{align*}
So the level-$\alpha$ two-sided test rejects $H_0$ when $|T_j|>z_{1-\alpha/2}$. This test targets the population coefficient $\beta_j^*$ in the full high-dimensional model, not the coefficient in the selected submodel.
[/example]
The last sentence is a key interpretive point. Debiased inference answers a full-model coordinate question; it is not a guarantee that the selected model is the correct explanatory subset.
## Confidence Intervals After Variable Selection
A frequent applied workflow first runs the Lasso to select variables and then asks for intervals on the selected coefficients. The debiased Lasso can be used after this workflow, but the target and the validity statement must be stated with care. The interval remains an interval for the coordinate of the original full parameter $\beta^*$, not for a refitted projection parameter inside the selected model.
[example: Interval For A Selected Variable]
Suppose the Lasso selects $\hat S=\{k:\hat\beta_k\ne0\}$ and the reported coordinate satisfies $j\in\hat S$. A naive post-selection interval first refits least squares on the selected columns,
\begin{align*}
\tilde\beta_{\hat S}\in \operatorname*{argmin}_{u\in\mathbb R^{|\hat S|}}\|Y-X_{\hat S}u\|_2^2,
\end{align*}
and then reports the fixed-model least-squares standard error for the $j$th selected coefficient. The difficulty is that $\hat S$ is itself a function of $Y$, so this standard error treats a random selected model as if it had been fixed before the data were observed.
The debiased interval for the same reported coordinate does not refit on $X_{\hat S}$. It uses the full-design correction
\begin{align*}
\hat b_j=\hat\beta_j+\hat\theta_j^\top\frac{X^\top(Y-X\hat\beta)}{n},
\end{align*}
where $\hat\theta_j$ is the nodewise direction built from all columns of $X$, and reports
\begin{align*}
\operatorname{CI}_j(1-\alpha)=\left[\hat b_j-z_{1-\alpha/2}\widehat{\operatorname{se}}_j,\hat b_j+z_{1-\alpha/2}\widehat{\operatorname{se}}_j\right].
\end{align*}
The interval covers the full-model coordinate $\beta_j^*$ exactly when
\begin{align*}
\hat b_j-z_{1-\alpha/2}\widehat{\operatorname{se}}_j\le \beta_j^*\le \hat b_j+z_{1-\alpha/2}\widehat{\operatorname{se}}_j.
\end{align*}
Subtracting $\hat b_j$ throughout and multiplying by $-1$ reverses the two inequalities, giving
\begin{align*}
-z_{1-\alpha/2}\widehat{\operatorname{se}}_j\le \hat b_j-\beta_j^*\le z_{1-\alpha/2}\widehat{\operatorname{se}}_j.
\end{align*}
Since $\widehat{\operatorname{se}}_j>0$, division by $\widehat{\operatorname{se}}_j$ gives
\begin{align*}
-z_{1-\alpha/2}\le \frac{\hat b_j-\beta_j^*}{\widehat{\operatorname{se}}_j}\le z_{1-\alpha/2}.
\end{align*}
Equivalently,
\begin{align*}
\left|\frac{\hat b_j-\beta_j^*}{\widehat{\operatorname{se}}_j}\right|\le z_{1-\alpha/2}.
\end{align*}
For a fixed coordinate $j$, *Coordinate Asymptotic Normality* gives
\begin{align*}
\frac{\hat b_j-\beta_j^*}{\widehat{\operatorname{se}}_j}\xrightarrow{d}\mathcal N(0,1),
\end{align*}
under the stated high-dimensional sparsity, nodewise stability, and consistent noise-estimation assumptions. If $Z\sim\mathcal N(0,1)$, the quantile definition gives $\mathbb P(-z_{1-\alpha/2}\le Z\le z_{1-\alpha/2})=1-\alpha$, so the limiting coverage for this fixed full-model coordinate is $1-\alpha$. Thus the interval is interpreted as an interval for $\beta_j^*$ in the original high-dimensional model, even though the coordinate was chosen from the Lasso active set.
[/example]
Selection can still influence which intervals are reported. If a practitioner reports intervals only for selected variables, the individual interval for a fixed coordinate has the asymptotic coverage described above, but simultaneous or selection-conditional claims require additional correction.
[remark: Fixed Coordinate Versus Data-Dependent Reporting]
The coordinate asymptotic normality theorem is stated for a fixed $j$. Reporting the most significant debiased statistic, or reporting all variables selected by the Lasso without multiplicity control, changes the inferential question. Familywise error, false discovery rate, or selective coverage statements require procedures beyond the single-coordinate result.
[/remark]
The chapter's main lesson is that high-dimensional inference is possible only after separating estimation from bias correction; Chapter 10 turns these requirements into tuning, noise-estimation, and diagnostic choices for practice. The Lasso supplies a sparse initial estimator with small $\ell_1$ error; nodewise regression supplies approximate inverse directions; the debiased estimator combines them so that the leading term is a normalised score and the remainder is of smaller order.
Debiasing completes the inferential picture, but it also makes clear that theory must be translated into practical choices of tuning, noise estimation, and diagnostics. The final chapter turns those abstract rates and remainder bounds into a workable procedure for data analysis, using the earlier chapters as the mathematical guide.
# 10. Tuning, Noise Estimation, and Practical Diagnostics
This chapter turns the finite-sample theory of sparse estimators into a workflow for data analysis. Chapters 1 through 4 treated the penalty level as a mathematical input, usually chosen large enough to dominate the empirical score. The prerequisites are the fixed-design linear model from Chapter 1, the KKT conditions from Chapter 2, the restricted-eigenvalue viewpoint from Chapter 3, and the Lasso oracle inequalities from Chapter 4. In practice the noise level is unknown, prediction and inference ask for different tuning behaviour, and the fitted active set changes as the penalty varies. The aim is to connect theoretical tuning, cross-validation, noise estimation, and diagnostics without losing sight of the events on which the Lasso theory rests.
## Choosing the Penalty Level
The first practical question is how to choose the regularisation parameter when the theory says that it should be comparable to an unknown stochastic quantity. For the linear model $Y = X\beta^* + \varepsilon$ with fixed design and columns normalised by $\|X_j\|_2^2/n = 1$, the basic Lasso theory asks for
\begin{align*}
\left\|\frac{X^\top \varepsilon}{n}\right\|_\infty \leq \frac{\lambda}{2}.
\end{align*}
When $\varepsilon \sim \mathcal N(0,\sigma^2 I_n)$, this suggests $\lambda$ of order $\sigma\sqrt{\log p/n}$, but it leaves two practical issues: $\sigma$ may be unknown, and a penalty chosen for prediction need not be suitable for variable selection or inference.
[definition: K Fold Cross-Validation Score]
Let $K \geq 2$ and let $\{I_1,\dots,I_K\}$ be a partition of $\{1,\dots,n\}$ into nonempty folds. For each $\lambda > 0$ and each $k$, let $\hat\beta_{-k}(\lambda)\in\mathbb R^p$ be a specified solution of the Lasso problem fitted with observations outside $I_k$, with a fixed deterministic tie-breaking rule if the solution is not unique. The $K$-fold cross-validation score is the map $\operatorname{CV}:(0,\infty)\to\mathbb R_{\geq 0}$ defined by
\begin{align*}
\operatorname{CV}(\lambda) := \frac{1}{n}\sum_{k=1}^K \|Y_{I_k} - X_{I_k}\hat\beta_{-k}(\lambda)\|_2^2.
\end{align*}
[/definition]
Cross-validation estimates out-of-sample prediction error, so it answers a prediction question rather than a support-recovery question. This distinction matters because the prediction-optimal penalty can be smaller than the penalty required to control false inclusions or to place the Lasso on the usual oracle-inequality event.
[example: Cross-Validation Undersmoothing For Inference]
Consider one active column $X_a$ and one null column $X_0$, both normalised by $\|X_a\|_2^2/n=\|X_0\|_2^2/n=1$, with empirical correlation $X_a^\top X_0/n=\rho$ close to $1$. If the true contribution is $bX_a$ and a fitted model moves an amount $t$ of coefficient from the active column to the null column, its fitted signal becomes $(b-t)X_a+tX_0$. The change in fitted values relative to the true signal $bX_a$ is
\begin{align*}
(b-t)X_a+tX_0-bX_a = t(X_0-X_a).
\end{align*}
The average squared size of this change is
\begin{align*}
\frac{1}{n}\|t(X_0-X_a)\|_2^2 = \frac{t^2}{n}(X_0-X_a)^\top(X_0-X_a).
\end{align*}
Expanding the quadratic form gives
\begin{align*}
(X_0-X_a)^\top(X_0-X_a)=X_0^\top X_0-2X_a^\top X_0+X_a^\top X_a.
\end{align*}
Using $X_0^\top X_0/n=1$, $X_a^\top X_a/n=1$, and $X_a^\top X_0/n=\rho$, we obtain
\begin{align*}
\frac{1}{n}\|t(X_0-X_a)\|_2^2=t^2(1-2\rho+1)=2t^2(1-\rho).
\end{align*}
Thus, when $\rho$ is close to $1$, assigning coefficient $t$ to the null covariate can change prediction error only by a small amount.
Along a decreasing Lasso penalty path, cross-validation may therefore prefer a value of $\lambda$ small enough to include $X_0$, because the prediction cost $2t^2(1-\rho)$ can be small compared with the sampling noise in the validation score. The selected model can still be too large for naive post-selection confidence intervals: those intervals condition neither on the random choice of $\lambda$ nor on the event that the null covariate entered. A larger theory-driven penalty, debiasing, or sample splitting targets an inferential error criterion rather than only held-out prediction error.
[/example]
The example shows why predictive tuning alone does not supply the probability event used in the Lasso proof. The next theorem gives a calibratable penalty level that makes the empirical score small with prescribed probability, so it provides a benchmark against which data-driven choices can be compared.
[quotetheorem:5585]
[citeproof:5585]
The tuning rule is a useful theoretical anchor because each hypothesis has a visible role. Gaussianity gives the exponential coordinate tail; without it, a heavy-tailed error distribution can make $\|X^\top\varepsilon/n\|_\infty$ much larger than the displayed threshold with non-negligible probability. The column normalisation fixes the variance scale of each score coordinate; if a column has norm far larger than $\sqrt n$, the same coefficient of $\sigma\sqrt{\log p/n}$ no longer controls that coordinate. The theorem controls only the score event, not support recovery, restricted eigenvalues, or prediction error by itself. It also exposes a new problem: the penalty cannot be evaluated unless $\sigma$ is already known or estimated. This motivates a modified optimisation criterion whose penalty scale is dimensionless from the start.
[definition: Square Root Lasso]
Fix $X\in\mathbb R^{n\times p}$ and $\lambda_{\sqrt{}} >0$. The square-root Lasso solution correspondence is the set-valued map $\widehat{\mathcal B}_{\mathrm{sqrt}}:\mathbb R^n\to 2^{\mathbb R^p}$ defined by
\begin{align*}
\widehat{\mathcal B}_{\mathrm{sqrt}}(Y) := \operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{\|Y-X\beta\|_2}{\sqrt n}+\lambda_{\sqrt{}}\|\beta\|_1\right\}.
\end{align*}
A square-root Lasso estimator is any map $\hat\beta_{\mathrm{sqrt}}:\mathbb R^n\to\mathbb R^p$ such that $\hat\beta_{\mathrm{sqrt}}(Y)\in\widehat{\mathcal B}_{\mathrm{sqrt}}(Y)$ for every $Y$ in its domain, with a specified tie-breaking rule when the solution is not unique.
[/definition]
The loss is homogeneous in the residual norm rather than its square. This raises the question of whether the score appearing in the KKT conditions can be calibrated without first estimating the noise level.
[quotetheorem:5586]
[citeproof:5586]
The pivotal theorem explains why the same numerical penalty can be reused after rescaling the response. The assumption $\varepsilon=\sigma z$ is what permits the cancellation of the unknown scale; if the noise is heteroscedastic, the numerator and denominator no longer share a single common factor. Gaussianity is also used for the concentration statement, so the displayed order is not a distribution-free guarantee. The theorem identifies the penalty scale for the score event, but it does not by itself verify restricted eigenvalues, rule out model misspecification, or guarantee exact variable selection. The next example makes the scale invariance visible at the level of the optimisation problem.
[example: Square Root Lasso Scale Invariance]
Suppose $c>0$, $Y'=cY$, and $X$ is unchanged. For a response vector $Z$, write the square-root Lasso objective as
\begin{align*}
Q_Z(\beta)=\frac{\|Z-X\beta\|_2}{\sqrt n}+\lambda_{\sqrt{}}\|\beta\|_1.
\end{align*}
For any $\beta\in\mathbb R^p$, the objective at the scaled coefficient $c\beta$ is
\begin{align*}
Q_{cY}(c\beta)=\frac{\|cY-X(c\beta)\|_2}{\sqrt n}+\lambda_{\sqrt{}}\|c\beta\|_1.
\end{align*}
Since $X(c\beta)=cX\beta$, this becomes
\begin{align*}
Q_{cY}(c\beta)=\frac{\|c(Y-X\beta)\|_2}{\sqrt n}+\lambda_{\sqrt{}}\|c\beta\|_1.
\end{align*}
Because $c>0$, the Euclidean and $\ell^1$ norms are positively homogeneous, so $\|c(Y-X\beta)\|_2=c\|Y-X\beta\|_2$ and $\|c\beta\|_1=c\|\beta\|_1$. Therefore
\begin{align*}
Q_{cY}(c\beta)=c\frac{\|Y-X\beta\|_2}{\sqrt n}+c\lambda_{\sqrt{}}\|\beta\|_1.
\end{align*}
Factoring out $c$ gives
\begin{align*}
Q_{cY}(c\beta)=cQ_Y(\beta).
\end{align*}
Now assume $\hat\beta_{\mathrm{sqrt}}$ minimises $Q_Y$. To prove that $c\hat\beta_{\mathrm{sqrt}}$ minimises $Q_{cY}$, take any $\theta\in\mathbb R^p$ and set $\gamma=\theta/c$, so that $\theta=c\gamma$. The identity above gives
\begin{align*}
Q_{cY}(c\hat\beta_{\mathrm{sqrt}})=cQ_Y(\hat\beta_{\mathrm{sqrt}}).
\end{align*}
Since $\hat\beta_{\mathrm{sqrt}}$ minimises $Q_Y$, we have
\begin{align*}
cQ_Y(\hat\beta_{\mathrm{sqrt}})\leq cQ_Y(\gamma).
\end{align*}
Applying the scaling identity again to $\gamma$ gives
\begin{align*}
cQ_Y(\gamma)=Q_{cY}(c\gamma)=Q_{cY}(\theta).
\end{align*}
Hence $Q_{cY}(c\hat\beta_{\mathrm{sqrt}})\leq Q_{cY}(\theta)$ for every $\theta\in\mathbb R^p$, so $c\hat\beta_{\mathrm{sqrt}}$ solves the square-root Lasso problem for $Y'=cY$ with the same $\lambda_{\sqrt{}}$. Multiplying the response by $c$ multiplies the whole objective by $c$, which is why the tuning parameter is not rescaled with the unknown noise level.
[/example]
## Estimating Noise and Checking Residuals
Once an estimator has been fitted, the next question is whether the residuals can be used to estimate the unknown noise level and to diagnose model mismatch. In low-dimensional regression the residual variance estimator divides by $n-p$, but in high dimension the active set is data-dependent and may be unstable. Noise estimation therefore has to account for shrinkage, selection, and the fact that $p$ may exceed $n$.
[definition: Residual Noise Estimate]
Fix $d>0$. Given an estimator map $\hat\beta:\mathbb R^n\to\mathbb R^p$ for a fixed design $X\in\mathbb R^{n\times p}$, define the residual map $\hat r:\mathbb R^n\to\mathbb R^n$ by $\hat r(Y)=Y-X\hat\beta(Y)$. A residual noise estimate is a statistic $\hat\sigma_d:\mathbb R^n\to\mathbb R_{\geq 0}$ of the form
\begin{align*}
\hat\sigma_d(Y) = \frac{\|\hat r(Y)\|_2}{\sqrt{d}}.
\end{align*}
[/definition]
The choice of $d$ is part of the statistical procedure. Using $d=n$ is stable but can underestimate $\sigma$ when the fit is too adaptive; using $d=n-|\hat S|$ mimics ordinary least squares but can be unreliable when the selected set $\hat S$ is large or selected by a small penalty. This motivates an estimator that updates the coefficient vector and the noise scale together rather than treating the residual degrees of freedom as fixed.
[definition: Scaled Lasso]
Fix $X\in\mathbb R^{n\times p}$ and $\lambda_0>0$. The scaled Lasso solution correspondence is the set-valued map $\widehat{\mathcal C}_{\mathrm{scaled}}:\mathbb R^n\to 2^{\mathbb R^p\times\mathbb R_{>0}}$ defined by
\begin{align*}
\widehat{\mathcal C}_{\mathrm{scaled}}(Y) := \operatorname*{argmin}_{\beta\in\mathbb R^p,\,\sigma>0}\left\{\frac{\|Y-X\beta\|_2^2}{2n\sigma}+\frac{\sigma}{2}+\lambda_0\|\beta\|_1\right\}.
\end{align*}
The scaled Lasso estimator is any map $(\hat\beta,\hat\sigma):\mathbb R^n\to\mathbb R^p\times\mathbb R_{>0}$ such that $(\hat\beta(Y),\hat\sigma(Y))\in\widehat{\mathcal C}_{\mathrm{scaled}}(Y)$ whenever the minimum is attained, with a specified tie-breaking rule if several minimisers exist.
[/definition]
The scaled Lasso turns noise estimation and coefficient estimation into a fixed-point problem. The difficulty is that the penalty level depends on the unknown scale, while the scale estimate depends on the residuals produced by the penalised fit. The optimality equations below expose this circular dependence as two coupled updates: a Lasso step at scale $\hat\sigma\lambda_0$ and a residual-norm update for $\hat\sigma$.
[quotetheorem:5587]
[citeproof:5587]
The fixed-point equations give a scale estimate tied to the chosen penalty, but their hypotheses matter. Positivity of $\sigma$ is needed because the objective contains $1/\sigma$; when the residual can be made exactly zero, the optimisation problem can degenerate at the boundary instead of producing a meaningful positive noise estimate. Convexity in each block explains the alternating-update interpretation, but the fixed-point equations do not certify that the selected variables are correct or that the residuals resemble the original noise. A small residual norm by itself is therefore not evidence that the model is correct. Residual diagnostics are needed to check for gross failures: nonlinear signal, heteroscedasticity, heavy tails, influential observations, and correlations left in the residuals.
[example: Residual Plot After Sparse Fitting]
Fit the Lasso over a grid $\Lambda\subset(0,\infty)$, choose $\hat\lambda$ by cross-validation, and write
\begin{align*}
\hat\beta=\hat\beta(\hat\lambda),\qquad \hat y=X\hat\beta,\qquad \hat r=Y-X\hat\beta .
\end{align*}
For observation $i$, this means
\begin{align*}
\hat y_i=(X\hat\beta)_i=\sum_{j=1}^p X_{ij}\hat\beta_j,\qquad
\hat r_i=Y_i-\hat y_i
=Y_i-\sum_{j=1}^p X_{ij}\hat\beta_j .
\end{align*}
Plot the pairs $(\hat y_i,\hat r_i)$, and also plot $(X_{ij},\hat r_i)$ for covariates $j$ whose observed values contain high-leverage or extreme points.
If the linear model with constant noise variance were a good approximation, then after fitting we would expect the residual cloud to have roughly comparable vertical spread across the range of fitted values. A funnel pattern means that, for two fitted-value regions $A$ and $B$, the empirical residual scales differ, for example
\begin{align*}
\frac{1}{|A|}\sum_{i:\hat y_i\in A}\hat r_i^2
\neq
\frac{1}{|B|}\sum_{i:\hat y_i\in B}\hat r_i^2,
\end{align*}
so a single constant variance level is not describing both regions equally well. A curved residual trend means that the conditional average residual changes with the fitted value; in binned form this appears as
\begin{align*}
\frac{1}{|A|}\sum_{i:\hat y_i\in A}\hat r_i
\end{align*}
being systematically positive in some fitted-value ranges and systematically negative in others. That is evidence that the linear mean $X\beta$ is missing structure, such as a nonlinear term or an interaction.
Lowering $\lambda$ can decrease the training residual norm because the optimisation is allowed to use a larger coefficient vector, but it does not by itself make the residual variance constant or remove a curved conditional pattern. The diagnostic is therefore about the modelling assumptions behind the oracle inequalities, not merely about whether the in-sample squared error can be made smaller.
[/example]
A complementary diagnostic is to ask whether a selected variable would still appear after random perturbations of the sample. Stability selection formalises this idea by subsampling and recording selection frequencies.
[definition: Stability Selection]
Fix a base selection method $A_\lambda:2^{\{1,\dots,n\}}\to 2^{\{1,\dots,p\}}$, and write $\hat S_\lambda(I)=A_\lambda(I)$ for the variables selected from observations indexed by $I$. For subsamples $I_b\subset\{1,\dots,n\}$ and penalty $\lambda$, the selection probability estimate for variable $j$ is the map $\hat\pi_j:(0,\infty)\to[0,1]$ defined by
\begin{align*}
\hat\pi_j(\lambda) := \frac{1}{B}\sum_{b=1}^B \mathbb{1}_{\{j\in\hat S_\lambda(I_b)\}}.
\end{align*}
The stability-selection output is the map $\hat S_{\mathrm{stab}}:(0,\infty)\times(1/2,1)\to 2^{\{1,\dots,p\}}$ defined by
\begin{align*}
\hat S_{\mathrm{stab}}(\lambda,\pi_{\mathrm{thr}}) := \{j\in\{1,\dots,p\}: \hat\pi_j(\lambda)\geq \pi_{\mathrm{thr}}\}.
\end{align*}
[/definition]
The selection-frequency threshold shifts the target from a single fitted active set to variables that repeatedly survive perturbation. The next theorem explains why this threshold can be interpreted as false-discovery control once exchangeability of null variables and a bound on the base method's model size are imposed.
[quotetheorem:5588]
[citeproof:5588]
The theorem is most useful when its assumptions are treated as modelling assumptions rather than automatic facts. Correlated groups of variables often violate the spirit of null exchangeability, and the following example shows how stability frequencies should then be read.
[example: Stability Frequencies In A Correlated Block]
Suppose the correlated block is $\{1,2,3,4,5\}$ and only variable $1$ carries signal. For a fixed penalty $\lambda$, write $N_j=\sum_{b=1}^B \mathbb{1}_{\{j\in \hat S_\lambda(I_b)\}}$, so the stability frequency is
\begin{align*}
\hat\pi_j(\lambda)=\frac{N_j}{B}.
\end{align*}
If, for example, the Lasso selects variables from the block across $B=100$ subsamples with counts
\begin{align*}
N_1=38,\qquad N_2=27,\qquad N_3=21,\qquad N_4=10,\qquad N_5=4,
\end{align*}
then the corresponding frequencies are
\begin{align*}
\hat\pi_1(\lambda)=\frac{38}{100}=0.38,\quad
\hat\pi_2(\lambda)=\frac{27}{100}=0.27,\quad
\hat\pi_3(\lambda)=\frac{21}{100}=0.21,\quad
\hat\pi_4(\lambda)=\frac{10}{100}=0.10,\quad
\hat\pi_5(\lambda)=\frac{4}{100}=0.04.
\end{align*}
The total block frequency is
\begin{align*}
\sum_{j=1}^5 \hat\pi_j(\lambda)
=
\frac{38+27+21+10+4}{100}
=
\frac{100}{100}
=
1,
\end{align*}
so the procedure selects one member of the block on average in each subsample, but no single member has frequency near $1$.
For a stability threshold such as $\pi_{\mathrm{thr}}=0.7$, the stable selected set inside the block is empty because
\begin{align*}
0.38<0.7,\quad 0.27<0.7,\quad 0.21<0.7,\quad 0.10<0.7,\quad 0.04<0.7.
\end{align*}
Thus the data support the presence of a predictive signal somewhere in the correlated block, while the individual covariate identity is not stable under subsampling.
[/example]
## Reading Regularisation Paths
After choosing a tuning method, the analyst should inspect how the fitted model changes as $\lambda$ varies. A regularisation path is not just a computational object; it records when variables enter, leave, or change sign, and it reveals whether a selected model sits in a stable region or at a fragile transition.
[definition: Lasso Regularisation Path]
For a grid $\Lambda\subset(0,\infty)$, the Lasso solution correspondence is the set-valued map $\mathcal P:\Lambda\to 2^{\mathbb R^p}$ defined by
\begin{align*}
\mathcal P(\lambda):=\operatorname*{argmin}_{\beta\in\mathbb R^p}\left\{\frac{\|Y-X\beta\|_2^2}{2n}+\lambda\|\beta\|_1\right\}.
\end{align*}
A chosen Lasso regularisation path is a map $\hat\beta:\Lambda\to\mathbb R^p$ such that $\hat\beta(\lambda)\in\mathcal P(\lambda)$ for every $\lambda\in\Lambda$.
[/definition]
The path is governed by KKT conditions. To interpret entries and exits of variables, we need the coordinate description of residual correlations along the path.
[quotetheorem:5589]
[citeproof:5589]
This description explains the usual path plot: an inactive variable becomes eligible to enter when its residual correlation touches $\lambda$ in magnitude. Convexity is essential because the subgradient condition is then sufficient as well as necessary; for a nonconvex penalty, the same coordinate equations would not rule out non-optimal stationary points. The assumption $\lambda>0$ keeps the $\ell^1$ subdifferential active and separates the active and inactive coordinate conditions; at $\lambda=0$ the problem becomes ordinary least squares, possibly with many minimisers when $p>n$. The KKT description does not guarantee support recovery or uniqueness of the path, because those require additional design assumptions such as restricted eigenvalues or general-position conditions. If the design is highly correlated, variables can enter and later leave because the active residual correlations are coupled through the Gram matrix.
[example: Simulated Sparse Lasso Path]
Generate $Y=X\beta^*+\varepsilon$ with $n=80$, $p=200$, support $S^*=\{1,2,3,4,5\}$, Gaussian noise, and standardised columns satisfying $\|X_j\|_2^2/n=1$. Along a decreasing grid $\Lambda$, write $\hat r(\lambda)=Y-X\hat\beta(\lambda)$ and $\hat S(\lambda)=\{j:\hat\beta_j(\lambda)\neq 0\}$. At the all-zero fit the residual is $Y$, so the initial coordinate score of variable $j$ is
\begin{align*}
\frac{X_j^\top Y}{n}=\frac{X_j^\top(X\beta^*+\varepsilon)}{n}.
\end{align*}
By distributivity,
\begin{align*}
\frac{X_j^\top(X\beta^*+\varepsilon)}{n}=\frac{X_j^\top X\beta^*}{n}+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
Since $X\beta^*=\sum_{k=1}^p X_k\beta_k^*$, the signal term is
\begin{align*}
\frac{X_j^\top X\beta^*}{n}=\sum_{k=1}^{p}\frac{X_j^\top X_k}{n}\beta_k^*.
\end{align*}
Because $\beta_k^*=0$ for $k\notin S^*$, this gives
\begin{align*}
\frac{X_j^\top Y}{n}=\sum_{k\in S^*}\frac{X_j^\top X_k}{n}\beta_k^*+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
If $j\in S^*$ and the correlations with the other active columns are small, the $k=j$ contribution is
\begin{align*}
\frac{X_j^\top X_j}{n}\beta_j^*=1\cdot\beta_j^*=\beta_j^*.
\end{align*}
Thus a large $|\beta_j^*|$ produces a large initial score unless it is offset by correlations with other active columns or by the noise score. If $j\notin S^*$ but $X_j$ is correlated with an active column $X_a$, then the same expansion can be written as
\begin{align*}
\frac{X_j^\top Y}{n}=\frac{X_j^\top X_a}{n}\beta_a^*+\sum_{k\in S^*,\,k\neq a}\frac{X_j^\top X_k}{n}\beta_k^*+\frac{X_j^\top\varepsilon}{n}.
\end{align*}
The first term on the right can be large when $X_j^\top X_a/n$ is large, even though the true coefficient of variable $j$ is zero.
The inactive-coordinate KKT condition along the Lasso path requires
\begin{align*}
\left|\frac{X_j^\top\hat r(\lambda)}{n}\right|\leq \lambda.
\end{align*}
Therefore, as $\lambda$ decreases, an inactive variable becomes eligible to enter when its absolute residual correlation reaches the current penalty level. Plotting the curves $\hat\beta_j(\lambda)$ against $\log\lambda$ separates the early entry of large active coefficients from the later entry of correlated null variables and noise variables. If the cross-validation curve is nearly flat across a range of $\lambda$ values, two penalties can have similar validation error while producing active sets with very different sizes, for example with $|\hat S(\lambda_{\mathrm{large}})|$ much smaller than $|\hat S(\lambda_{\mathrm{small}})|$. The path plot shows whether the cross-validated choice lies in a stable region of the path or near a transition where many weak variables enter.
[/example]
The simulation illustrates why a single selected model can hide path instability. A common practical response is to prefer a larger penalty when the estimated prediction error is nearly tied across a range of values.
[remark: One Standard Error Rule]
In cross-validation, the minimum-CV rule chooses a penalty with the smallest estimated prediction error. The one-standard-error rule instead chooses the largest penalty whose CV score is within one estimated standard error of the minimum. This often gives a sparser and more stable model, although it is still a prediction-driven rule rather than a theorem-level guarantee for support recovery.
[/remark]
The path should be interpreted together with the scale estimate and residual checks. A stable active set at a very small residual variance estimate may indicate overfitting; a stable path with patterned residuals may indicate model misspecification rather than successful sparse recovery.
## Practical Workflow For Sparse Regularisation
The final question is how the pieces fit together in an analysis that respects both the theory and the data. The following workflow separates prediction, estimation, inference, and diagnostics, because each objective places a different demand on the tuning parameter.
[explanation: Recommended Diagnostic Workflow]
Start by standardising columns and recording the normalisation used, since the penalty acts coordinatewise through the scale of $X_j^\top r/n$. Fit a full regularisation path and mark the cross-validation minimum, the one-standard-error choice, and a theory-motivated penalty level when a noise estimate is available. Estimate $\sigma$ using a method consistent with the intended goal, such as square-root or scaled Lasso when the noise level is not known in advance. Inspect residual plots and selection stability before interpreting individual variables. For inference, do not treat the selected Lasso coefficients as ordinary least-squares coefficients unless a valid post-selection or debiased procedure is being applied.
[/explanation]
This workflow reflects the main lesson of the chapter: tuning is not a single universal operation. The same data set can support a small penalty for prediction, a larger penalty for sparse model interpretation, a pivotal penalty for unknown noise, and a stability threshold for reproducible selection.
[remark: What Diagnostics Can And Cannot Certify]
Good residual plots, stable paths, and plausible noise estimates increase confidence that the procedure is behaving as intended. They do not verify restricted eigenvalue conditions, exact sparsity, Gaussian noise, or selection consistency. The formal guarantees still rely on the explicit score, sparsity, and restricted-geometry assumptions developed throughout the course, while diagnostics help decide whether those assumptions are a reasonable approximation in the application at hand.
[/remark]
## Beyond Sparse Linear Models
Sparse regularisation is one entrance to high-dimensional statistics, not the whole subject. The same ideas reappear in graphical models, matrix completion, compressed sensing, high-dimensional covariance estimation, and nonparametric regression with structured penalties. In each case the estimator succeeds only when the statistical signal, the geometry of the design, and the tuning rule are compatible.
This course has emphasised linear models because they make the central mechanisms visible: score bounds control noise, restricted geometry converts prediction error into coefficient error, and diagnostics separate prediction from variable-selection claims. Later topics replace the $\ell^1$ norm by other structures, such as nuclear norms for low-rank matrices or group penalties for block sparsity, but the same question remains: what low-complexity structure is being imposed, and what evidence says that the data support it?
## References
- Internal Androma references:
- [Regression and Linear Models](/page/Regression%20and%20Linear%20Models), for the classical low-dimensional linear-model background behind least squares, prediction error, and residual diagnostics.
- [Multivariate Statistical Analysis](/page/Multivariate%20Statistical%20Analysis), for covariance, concentration, and matrix viewpoints that support high-dimensional inference.
- P. Buhlmann and S. van de Geer, *Statistics for High-Dimensional Data*.
- M. J. Wainwright, *High-Dimensional Statistics: A Non-Asymptotic Viewpoint*.
- T. Hastie, R. Tibshirani, and M. Wainwright, *Statistical Learning with Sparsity*.
- S. van de Geer, *Estimation and Testing Under Sparsity*.
Contents
- Introduction
- Why High Dimension Changes Regression
- Sparsity as a Statistical Assumption
- Regularisation and the Lasso
- Probability Enters Through Score Concentration
- Geometry, Oracle Bounds, and Inference
- 1. High-Dimensional Linear Models and Sparse Structure
- Linear Models in High Dimension
- Sparse Structure and Effective Dimension
- Scaling, Normalisation, and Noise
- 2. The Lasso: Formulations, KKT Conditions, and Geometry
- Penalised and Constrained Formulations
- KKT Conditions and Residual Correlations
- The Basic Inequality for Penalised Empirical Risk
- Soft-Thresholding in the Orthonormal Design Case
- Geometry with Correlated Variables
- 3. Cone Conditions and Restricted Eigenvalues
- The Lasso Cone from the Basic Inequality
- Restricted Eigenvalue and Compatibility Conditions
- Sparse Eigenvalues and Restricted Strong Convexity
- Why Ordinary Eigenvalues Fail in High Dimension
- 4. Oracle Inequalities for Lasso
- Slow-Rate Prediction Without Restricted Eigenvalues
- Fast-Rate Bounds from Sparse Geometry
- Oracle Interpretation
- 5. Variable Selection and Sign Recovery
- Model Selection and Prediction Ask Different Questions
- The Irrepresentable Condition and the Beta-Min Barrier
- Thresholding the Lasso
- False Positives, False Negatives, and Exact Recovery Limits
- 6. Elastic Net, Adaptive Lasso, and Structured Penalties
- Elastic Net and Collinearity
- Adaptive Lasso and Weighted Coordinate Penalties
- Group Lasso and Sparse-Group Penalties
- Connections and Tradeoffs
- 7. Dantzig Selector and Constrained Estimation
- The Score Constraint Behind the Dantzig Selector
- Cone Geometry and Deterministic Error Bounds
- Relationship with the Lasso
- Choosing and Interpreting the Constraint Radius
- 8. Nonconvex Regularisation: SCAD and MCP
- Why Folded-Concave Penalties Reduce Shrinkage Bias
- Local Minima and Coordinate Descent
- Oracle Properties and Sparse Local Optima
- 9. Debiased Lasso and High-Dimensional Inference
- Why Naive Lasso Confidence Intervals Fail
- Nodewise Regression and Approximate Inverse Gram Matrices
- Debiased Estimator and Its Expansion
- Asymptotic Normality and Confidence Intervals
- Confidence Intervals After Variable Selection
- 10. Tuning, Noise Estimation, and Practical Diagnostics
- Choosing the Penalty Level
- Estimating Noise and Checking Residuals
- Reading Regularisation Paths
- Practical Workflow For Sparse Regularisation
- Beyond Sparse Linear Models
- References
High-Dimensional Statistics I: Sparsity and Regularisation
Content
Problems
History
Created by admin on 6/6/2026 | Last updated on 6/6/2026
Prerequisites
No prerequisites required for this page.
Rate this page
★
★
★
★
★
Poor
Excellent