[proofplan]
The first direction uses only a comparison of objective values: penalized optimality is tested against vectors whose $\ell^1$ norm is no larger than $\|\hat{\beta}_\lambda\|_1$. The penalty terms then order in the correct direction, forcing the empirical squared-error loss at $\hat{\beta}_\lambda$ to be minimal on the corresponding $\ell^1$ ball. For the converse, the stationarity condition says exactly that $0$ belongs to the subdifferential of the penalized objective at $\hat{\beta}_t$; the convex subgradient inequality then proves global optimality. The feasibility and [complementary slackness](/theorems/2559) assumptions are used only to recognize the constrained KKT system, so the penalized conclusion follows as an application of the weaker stationarity argument.
[/proofplan]
custom_env
admin
[step:Compare the penalized objective on the constrained feasible set]Let $n,p \in \mathbb{N}$, let $X \in \mathbb{R}^{n \times p}$ be the design matrix, and let $Y \in \mathbb{R}^n$ be the response vector. For each radius $r \geq 0$, let
\begin{align*}
C_r := \{\gamma \in \mathbb{R}^p : \|\gamma\|_1 \leq r\}
\end{align*}
denote the closed $\ell^1$ ball of radius $r$. Define the empirical squared-error loss $L_n: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
L_n(\beta) := \frac{1}{2n}|Y - X\beta|^2
\end{align*}
for each $\beta \in \mathbb{R}^p$. For each $\lambda \geq 0$, define the penalized Lasso objective $P_\lambda: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
P_\lambda(\beta) := L_n(\beta) + \lambda\|\beta\|_1
\end{align*}
for each $\beta \in \mathbb{R}^p$. Fix $\lambda \geq 0$, and let $\hat{\beta}_\lambda \in \mathbb{R}^p$ minimize $P_\lambda$ over $\mathbb{R}^p$. Define
\begin{align*}
t := \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Let $\beta \in C_t$ be arbitrary. By the definition of $C_t$,
\begin{align*}
\|\beta\|_1 \leq t = \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Since $\hat{\beta}_\lambda$ minimizes $P_\lambda$ over $\mathbb{R}^p$, we have
\begin{align*}
L_n(\hat{\beta}_\lambda) + \lambda \|\hat{\beta}_\lambda\|_1
\leq
L_n(\beta) + \lambda \|\beta\|_1 .
\end{align*}
Using $\lambda \geq 0$ and $\|\beta\|_1 \leq \|\hat{\beta}_\lambda\|_1$, the right-hand side satisfies
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\leq
L_n(\beta) + \lambda \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Therefore
\begin{align*}
L_n(\hat{\beta}_\lambda) + \lambda \|\hat{\beta}_\lambda\|_1
\leq
L_n(\beta) + \lambda \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Subtracting the finite scalar $\lambda \|\hat{\beta}_\lambda\|_1$ from both sides gives
\begin{align*}
L_n(\hat{\beta}_\lambda) \leq L_n(\beta).
\end{align*}
Since $\beta \in C_t$ was arbitrary, $\hat{\beta}_\lambda$ minimizes $L_n$ over $C_t$.[/step]
custom_env
admin
[guided]The key point is that the constrained radius is chosen after solving the penalized problem. We work with $n,p \in \mathbb{N}$, a design matrix $X \in \mathbb{R}^{n \times p}$, and a response vector $Y \in \mathbb{R}^n$. For each $r \geq 0$, the constrained feasible set is
\begin{align*}
C_r := \{\gamma \in \mathbb{R}^p : \|\gamma\|_1 \leq r\}.
\end{align*}
The empirical squared-error loss is the map $L_n: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
L_n(\beta) := \frac{1}{2n}|Y - X\beta|^2
\end{align*}
for each $\beta \in \mathbb{R}^p$, and the penalized objective is the map $P_\lambda: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
P_\lambda(\beta) := L_n(\beta) + \lambda\|\beta\|_1
\end{align*}
for each $\beta \in \mathbb{R}^p$. We set
\begin{align*}
t := \|\hat{\beta}_\lambda\|_1,
\end{align*}
so every feasible vector $\beta \in C_t$ satisfies, by the definition of $C_t$,
\begin{align*}
\|\beta\|_1 \leq \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Now compare $\hat{\beta}_\lambda$ with such a feasible vector $\beta$. Penalized optimality gives
\begin{align*}
L_n(\hat{\beta}_\lambda) + \lambda \|\hat{\beta}_\lambda\|_1
\leq
L_n(\beta) + \lambda \|\beta\|_1 .
\end{align*}
Because $\lambda \geq 0$, multiplying the inequality
\begin{align*}
\|\beta\|_1 \leq \|\hat{\beta}_\lambda\|_1
\end{align*}
by $\lambda$ preserves the order:
\begin{align*}
\lambda \|\beta\|_1 \leq \lambda \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Hence
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\leq
L_n(\beta) + \lambda \|\hat{\beta}_\lambda\|_1 .
\end{align*}
Combining the two inequalities yields
\begin{align*}
L_n(\hat{\beta}_\lambda) + \lambda \|\hat{\beta}_\lambda\|_1
\leq
L_n(\beta) + \lambda \|\hat{\beta}_\lambda\|_1 .
\end{align*}
The same finite penalty term appears on both sides, so subtracting it gives
\begin{align*}
L_n(\hat{\beta}_\lambda) \leq L_n(\beta).
\end{align*}
This holds for every $\beta \in C_t$, so $\hat{\beta}_\lambda$ is a constrained Lasso solution at radius $t$.[/guided]
custom_env
admin
[step:Use the KKT stationarity condition as a penalized subgradient condition]Let $\hat{\beta}_t \in \mathbb{R}^p$, and suppose there exist $\lambda \geq 0$ and $z \in \partial \|\hat{\beta}_t\|_1$ satisfying
\begin{align*}
-\frac{1}{n}X^\top(Y - X\hat{\beta}_t) + \lambda z = 0.
\end{align*}
The loss $L_n$ is differentiable. Its gradient is the map $\nabla L_n: \mathbb{R}^p \to \mathbb{R}^p$ given by
\begin{align*}
\nabla L_n(\beta) = -\frac{1}{n}X^\top(Y - X\beta)
\end{align*}
for each $\beta \in \mathbb{R}^p$. Define the constant second-derivative matrix $H_n \in \mathbb{R}^{p \times p}$ by
\begin{align*}
H_n := \frac{1}{n}X^\top X.
\end{align*}
For every $v \in \mathbb{R}^p$,
\begin{align*}
v^\top H_n v = \frac{1}{n}|Xv|^2 \geq 0,
\end{align*}
so $H_n$ is positive semidefinite and $L_n$ is a differentiable convex function. Thus the displayed stationarity condition is
\begin{align*}
\nabla L_n(\hat{\beta}_t) + \lambda z = 0.
\end{align*}
Let $\beta \in \mathbb{R}^p$ be arbitrary. Define the auxiliary one-dimensional map $\phi_\beta: [0,1] \to \mathbb{R}$ by
\begin{align*}
\phi_\beta(s) := L_n(\hat{\beta}_t + s(\beta - \hat{\beta}_t))
\end{align*}
for each $s \in [0,1]$. Since $L_n$ is a differentiable [convex function](/page/Convex%20Function), $\phi_\beta$ is differentiable and convex. For $0 < s \leq 1$, convexity gives
\begin{align*}
\phi_\beta(s) \leq (1-s)\phi_\beta(0) + s\phi_\beta(1),
\end{align*}
so
\begin{align*}
\frac{\phi_\beta(s)-\phi_\beta(0)}{s} \leq \phi_\beta(1)-\phi_\beta(0).
\end{align*}
Letting $s \downarrow 0$ and using differentiability of $\phi_\beta$ gives
\begin{align*}
\nabla L_n(\hat{\beta}_t)^\top(\beta - \hat{\beta}_t) \leq L_n(\beta)-L_n(\hat{\beta}_t),
\end{align*}
which is equivalently
\begin{align*}
L_n(\beta)
\geq
L_n(\hat{\beta}_t) + \nabla L_n(\hat{\beta}_t)^\top(\beta - \hat{\beta}_t).
\end{align*}
The $\ell^1$ norm $\|\cdot\|_1: \mathbb{R}^p \to \mathbb{R}$ is convex. Since $z \in \partial \|\hat{\beta}_t\|_1$, the defining inequality for a [subgradient](/page/Subgradient) of this convex function at $\hat{\beta}_t$ gives
\begin{align*}
\|\beta\|_1
\geq
\|\hat{\beta}_t\|_1 + z^\top(\beta - \hat{\beta}_t).
\end{align*}
Multiplying the [second inequality](/theorems/2899) by $\lambda \geq 0$ and adding it to the first gives
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\geq
L_n(\hat{\beta}_t) + \lambda \|\hat{\beta}_t\|_1
+
\left(\nabla L_n(\hat{\beta}_t) + \lambda z\right)^\top(\beta - \hat{\beta}_t).
\end{align*}
Using $\nabla L_n(\hat{\beta}_t) + \lambda z = 0$, this becomes
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\geq
L_n(\hat{\beta}_t) + \lambda \|\hat{\beta}_t\|_1 .
\end{align*}
Therefore
\begin{align*}
P_\lambda(\beta) \geq P_\lambda(\hat{\beta}_t).
\end{align*}
Since $\beta \in \mathbb{R}^p$ was arbitrary, $\hat{\beta}_t$ minimizes $P_\lambda$ over $\mathbb{R}^p$.[/step]
custom_env
admin
[guided]The converse is a convex optimality argument. We start with $\hat{\beta}_t \in \mathbb{R}^p$ and assume that there are $\lambda \geq 0$ and $z \in \partial\|\hat{\beta}_t\|_1$ such that
\begin{align*}
-\frac{1}{n}X^\top(Y - X\hat{\beta}_t) + \lambda z = 0.
\end{align*}
The loss $L_n: \mathbb{R}^p \to \mathbb{R}$ is differentiable, and its gradient is
\begin{align*}
\nabla L_n(\beta) = -\frac{1}{n}X^\top(Y - X\beta)
\end{align*}
for each $\beta \in \mathbb{R}^p$. To verify convexity rather than assume it, define
\begin{align*}
H_n := \frac{1}{n}X^\top X.
\end{align*}
For every $v \in \mathbb{R}^p$,
\begin{align*}
v^\top H_n v = \frac{1}{n}|Xv|^2 \geq 0,
\end{align*}
so the Hessian matrix is positive semidefinite and $L_n$ is convex. Thus the stationarity condition can be written as
\begin{align*}
\nabla L_n(\hat{\beta}_t) + \lambda z = 0.
\end{align*}
Now fix an arbitrary comparison vector $\beta \in \mathbb{R}^p$. We need to prove that the penalized objective at $\beta$ is no smaller than the penalized objective at $\hat{\beta}_t$. Define $\phi_\beta: [0,1] \to \mathbb{R}$ by
\begin{align*}
\phi_\beta(s) := L_n(\hat{\beta}_t + s(\beta - \hat{\beta}_t)).
\end{align*}
Because $L_n$ is differentiable and convex, $\phi_\beta$ is differentiable and convex. For $0 < s \leq 1$, convexity gives
\begin{align*}
\phi_\beta(s) \leq (1-s)\phi_\beta(0) + s\phi_\beta(1),
\end{align*}
and therefore
\begin{align*}
\frac{\phi_\beta(s)-\phi_\beta(0)}{s} \leq \phi_\beta(1)-\phi_\beta(0).
\end{align*}
Letting $s \downarrow 0$ gives
\begin{align*}
\nabla L_n(\hat{\beta}_t)^\top(\beta - \hat{\beta}_t) \leq L_n(\beta)-L_n(\hat{\beta}_t),
\end{align*}
so
\begin{align*}
L_n(\beta)
\geq
L_n(\hat{\beta}_t) + \nabla L_n(\hat{\beta}_t)^\top(\beta - \hat{\beta}_t).
\end{align*}
The second ingredient is the subgradient inequality for the convex map $\|\cdot\|_1: \mathbb{R}^p \to \mathbb{R}$. Since $z \in \partial\|\hat{\beta}_t\|_1$, the definition of [subgradient](/page/Subgradient) gives
\begin{align*}
\|\beta\|_1
\geq
\|\hat{\beta}_t\|_1 + z^\top(\beta - \hat{\beta}_t).
\end{align*}
Multiplying by $\lambda \geq 0$ preserves the inequality. Adding this inequality to the lower bound for $L_n(\beta)$ gives
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\geq
L_n(\hat{\beta}_t) + \lambda \|\hat{\beta}_t\|_1
+
\left(\nabla L_n(\hat{\beta}_t)+\lambda z\right)^\top(\beta - \hat{\beta}_t).
\end{align*}
The final inner-product term is zero by stationarity, hence
\begin{align*}
L_n(\beta) + \lambda \|\beta\|_1
\geq
L_n(\hat{\beta}_t) + \lambda \|\hat{\beta}_t\|_1.
\end{align*}
This is exactly
\begin{align*}
P_\lambda(\beta) \geq P_\lambda(\hat{\beta}_t).
\end{align*}
Since the comparison vector $\beta \in \mathbb{R}^p$ was arbitrary, $\hat{\beta}_t$ minimizes the penalized Lasso objective over all of $\mathbb{R}^p$.[/guided]
custom_env
admin
[step:Interpret complementary slackness as part of the constrained KKT formulation]The preceding step used only the stationarity condition
\begin{align*}
\nabla L_n(\hat{\beta}_t) + \lambda z = 0
\end{align*}
with $z \in \partial\|\hat{\beta}_t\|_1$ and $\lambda \geq 0$ to prove penalized optimality. If a radius $t \geq 0$ is also fixed, the feasibility condition $\|\hat{\beta}_t\|_1 \leq t$ and the complementary slackness condition
\begin{align*}
\lambda(\|\hat{\beta}_t\|_1 - t) = 0
\end{align*}
are therefore not needed for the penalized conclusion itself; they identify the displayed stationarity equation as part of the convex KKT system for the constrained formulation. Consequently, the weaker stationarity and subgradient hypotheses already imply penalized optimality, and the stated constrained KKT assumptions give one important case in which those hypotheses hold. Thus any constrained Lasso solution satisfying these KKT conditions solves the penalized Lasso with parameter $\lambda$.[/step]
custom_env
admin
[guided]The preceding step proves a slightly stronger statement than the theorem needs. It shows that if $\hat{\beta}_t \in \mathbb{R}^p$, $\lambda \geq 0$, and $z \in \partial\|\hat{\beta}_t\|_1$ satisfy
\begin{align*}
\nabla L_n(\hat{\beta}_t) + \lambda z = 0,
\end{align*}
then $\hat{\beta}_t$ minimizes the penalized objective $P_\lambda: \mathbb{R}^p \to \mathbb{R}$ over $\mathbb{R}^p$.
Now return to the constrained problem at radius $t \geq 0$. The feasibility condition is
\begin{align*}
\|\hat{\beta}_t\|_1 \leq t,
\end{align*}
and the complementary slackness condition is
\begin{align*}
\lambda(\|\hat{\beta}_t\|_1 - t) = 0.
\end{align*}
These two conditions identify $\hat{\beta}_t$ and $\lambda$ as part of the convex KKT formulation for the constrained Lasso problem, but the penalized optimality proof has already consumed the only condition it needs: stationarity together with $z \in \partial\|\hat{\beta}_t\|_1$ and $\lambda \geq 0$. Therefore any constrained Lasso solution satisfying the displayed KKT conditions satisfies the hypotheses of the penalized subgradient argument, and hence solves the penalized Lasso with parameter $\lambda$.[/guided]