[proofplan]
We write the lasso objective as the sum of a differentiable quadratic term and a separable absolute-value penalty. First we compute the exact change in the quadratic term under an arbitrary perturbation and record the supporting-line inequality for the absolute value. These two facts show that the displayed stationarity condition is sufficient for global minimality. Conversely, if $\hat{\beta}$ is a minimizer, one-dimensional perturbations in each coordinate force the corresponding residual correlation to equal $\lambda \operatorname{sgn}(\hat{\beta}_j)$ on the active coordinates and to lie in $[-\lambda,\lambda]$ on the inactive coordinates.
[/proofplan]
custom_env
admin
[step:Compute the change in the quadratic loss under a perturbation]Define the residual vector $r \in \mathbb{R}^n$ by
\begin{align*}
r := y - X\hat{\beta}.
\end{align*}
For an arbitrary perturbation $h \in \mathbb{R}^p$, we have
\begin{align*}
y - X(\hat{\beta}+h) = r - Xh.
\end{align*}
Therefore
\begin{align*}
\frac{1}{2}|y - X(\hat{\beta}+h)|^2 - \frac{1}{2}|y - X\hat{\beta}|^2
&= \frac{1}{2}|r - Xh|^2 - \frac{1}{2}|r|^2\\
&= -h^\top X^\top r + \frac{1}{2}|Xh|^2.
\end{align*}
This identity will be used with $h = \beta - \hat{\beta}$.[/step]
custom_env
admin
[guided]The quadratic part of the lasso objective is smooth, so its first-order term should involve the gradient. We compute the exact finite difference rather than only the derivative, because the remaining second-order term is nonnegative and will prove global minimality.
Define the residual vector $r \in \mathbb{R}^n$ by
\begin{align*}
r := y - X\hat{\beta}.
\end{align*}
For any perturbation $h \in \mathbb{R}^p$, the residual at $\hat{\beta}+h$ is
\begin{align*}
y - X(\hat{\beta}+h) = y - X\hat{\beta} - Xh = r - Xh.
\end{align*}
Expanding the Euclidean square gives
\begin{align*}
\frac{1}{2}|y - X(\hat{\beta}+h)|^2 - \frac{1}{2}|y - X\hat{\beta}|^2
&= \frac{1}{2}|r - Xh|^2 - \frac{1}{2}|r|^2\\
&= \frac{1}{2}\left(|r|^2 - 2r^\top Xh + |Xh|^2\right) - \frac{1}{2}|r|^2\\
&= -h^\top X^\top r + \frac{1}{2}|Xh|^2.
\end{align*}
The linear term is $-h^\top X^\top r$, and the quadratic remainder $\frac{1}{2}|Xh|^2$ is nonnegative.[/guided]
custom_env
admin
[step:Use the absolute-value subgradient inequality coordinate by coordinate]
Let $t \in \mathbb{R}$ and let $s \in \partial |t|$. Then for every $u \in \mathbb{R}$,
\begin{align*}
|u| \geq |t| + s(u-t).
\end{align*}
Indeed, if $t>0$, then $s=1$ and $|u| \geq u = |t| + u - t$. If $t<0$, then $s=-1$ and $|u| \geq -u = |t| - (u-t)$. If $t=0$, then $s \in [-1,1]$, so $su \leq |s||u| \leq |u|$, which gives $|u| \geq |0| + s(u-0)$.
Consequently, if $z_j \in \partial |\hat{\beta}_j|$ for every $j \in \{1,\dots,p\}$, then for every $\beta \in \mathbb{R}^p$,
\begin{align*}
\sum_{j=1}^p |\beta_j|
\geq \sum_{j=1}^p |\hat{\beta}_j| + \sum_{j=1}^p z_j(\beta_j-\hat{\beta}_j).
\end{align*}
[/step]
custom_env
admin
[step:Prove that the KKT condition implies global minimality]Assume that there exists $z \in \mathbb{R}^p$ such that $z_j \in \partial |\hat{\beta}_j|$ for every $j \in \{1,\dots,p\}$ and
\begin{align*}
X^\top r = \lambda z.
\end{align*}
Let $\beta \in \mathbb{R}^p$ be arbitrary, and define $h \in \mathbb{R}^p$ by
\begin{align*}
h := \beta - \hat{\beta}.
\end{align*}
Using the quadratic identity and the coordinate subgradient inequality,
\begin{align*}
F(\beta)-F(\hat{\beta})
&\geq -h^\top X^\top r + \frac{1}{2}|Xh|^2 + \lambda z^\top h\\
&= -h^\top \lambda z + \frac{1}{2}|Xh|^2 + \lambda z^\top h\\
&= \frac{1}{2}|Xh|^2\\
&\geq 0.
\end{align*}
Thus $F(\beta) \geq F(\hat{\beta})$ for every $\beta \in \mathbb{R}^p$, so $\hat{\beta}$ minimizes $F$.[/step]
custom_env
admin
[guided]Assume the displayed KKT condition holds. That means there is a vector $z \in \mathbb{R}^p$ such that each coordinate $z_j$ is a valid subgradient of the absolute value at $\hat{\beta}_j$, and the residual correlation vector satisfies
\begin{align*}
X^\top r = \lambda z.
\end{align*}
To prove global minimality, take an arbitrary competitor $\beta \in \mathbb{R}^p$ and define the perturbation $h \in \mathbb{R}^p$ by
\begin{align*}
h := \beta - \hat{\beta}.
\end{align*}
The quadratic part changes by
\begin{align*}
\frac{1}{2}|y - X\beta|^2 - \frac{1}{2}|y - X\hat{\beta}|^2
= -h^\top X^\top r + \frac{1}{2}|Xh|^2.
\end{align*}
The penalty part is controlled by the supporting-line inequality for $|\cdot|$ in each coordinate:
\begin{align*}
\sum_{j=1}^p |\beta_j| - \sum_{j=1}^p |\hat{\beta}_j|
\geq \sum_{j=1}^p z_j(\beta_j-\hat{\beta}_j)
= z^\top h.
\end{align*}
Adding the two estimates gives
\begin{align*}
F(\beta)-F(\hat{\beta})
&\geq -h^\top X^\top r + \frac{1}{2}|Xh|^2 + \lambda z^\top h.
\end{align*}
Now substitute the stationarity condition $X^\top r=\lambda z$:
\begin{align*}
F(\beta)-F(\hat{\beta})
&\geq -h^\top \lambda z + \frac{1}{2}|Xh|^2 + \lambda z^\top h\\
&= \frac{1}{2}|Xh|^2\\
&\geq 0.
\end{align*}
Because $\beta$ was arbitrary, every competitor has objective value at least $F(\hat{\beta})$. Hence $\hat{\beta}$ is a global minimizer.[/guided]
custom_env
admin
[step:Derive the coordinate conditions from one-dimensional perturbations]Assume that $\hat{\beta}$ minimizes $F$. Fix $j \in \{1,\dots,p\}$, and let $e_j \in \mathbb{R}^p$ denote the $j$-th standard basis vector. Define the scalar residual correlation
\begin{align*}
a_j := x_j^\top r.
\end{align*}
For every $t \in \mathbb{R}$, minimality of $\hat{\beta}$ applied to $\hat{\beta}+t e_j$ gives
\begin{align*}
0
&\leq F(\hat{\beta}+t e_j)-F(\hat{\beta})\\
&= -t a_j + \frac{1}{2}t^2|x_j|^2
+ \lambda\left(|\hat{\beta}_j+t|-|\hat{\beta}_j|\right).
\end{align*}
If $\hat{\beta}_j>0$, then for all sufficiently small $t$ we have $\hat{\beta}_j+t>0$, and hence
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 + \lambda t.
\end{align*}
Dividing by $t>0$ and letting $t \downarrow 0$ gives $a_j \leq \lambda$; dividing by $t<0$ and letting $t \uparrow 0$ gives $a_j \geq \lambda$. Thus $a_j=\lambda=\lambda\operatorname{sgn}(\hat{\beta}_j)$.
If $\hat{\beta}_j<0$, then for all sufficiently small $t$ we have $\hat{\beta}_j+t<0$, and hence
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 - \lambda t.
\end{align*}
Dividing by $t>0$ and letting $t \downarrow 0$ gives $a_j \leq -\lambda$; dividing by $t<0$ and letting $t \uparrow 0$ gives $a_j \geq -\lambda$. Thus $a_j=-\lambda=\lambda\operatorname{sgn}(\hat{\beta}_j)$.
If $\hat{\beta}_j=0$, then
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 + \lambda |t|.
\end{align*}
Dividing by $t>0$ and letting $t \downarrow 0$ gives $a_j \leq \lambda$. Dividing by $t<0$ and letting $t \uparrow 0$ gives $a_j \geq -\lambda$. Hence $|a_j|\leq \lambda$.
Therefore, for every $j \in \{1,\dots,p\}$,
\begin{align*}
x_j^\top(y-X\hat{\beta}) &= \lambda \operatorname{sgn}(\hat{\beta}_j) \qquad \text{if } \hat{\beta}_j \neq 0,\\
|x_j^\top(y-X\hat{\beta})| &\leq \lambda \qquad \text{if } \hat{\beta}_j = 0.
\end{align*}[/step]
custom_env
admin
[guided]Now assume $\hat{\beta}$ is already known to minimize $F$. We test minimality in only one coordinate at a time. Fix $j \in \{1,\dots,p\}$, and let $e_j \in \mathbb{R}^p$ be the $j$-th standard basis vector. Define the scalar residual correlation
\begin{align*}
a_j := x_j^\top r.
\end{align*}
For every scalar perturbation $t \in \mathbb{R}$, the competitor $\hat{\beta}+t e_j$ is admissible. Since $\hat{\beta}$ minimizes $F$,
\begin{align*}
0
&\leq F(\hat{\beta}+t e_j)-F(\hat{\beta})\\
&= -t a_j + \frac{1}{2}t^2|x_j|^2
+ \lambda\left(|\hat{\beta}_j+t|-|\hat{\beta}_j|\right).
\end{align*}
This one-dimensional inequality contains the entire coordinate optimality condition.
Suppose first that $\hat{\beta}_j>0$. For all sufficiently small $t$, the number $\hat{\beta}_j+t$ remains positive, so
\begin{align*}
|\hat{\beta}_j+t|-|\hat{\beta}_j| = t.
\end{align*}
Thus
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 + \lambda t.
\end{align*}
For $t>0$, divide by $t$ and let $t \downarrow 0$:
\begin{align*}
0 \leq -a_j+\lambda,
\end{align*}
so $a_j \leq \lambda$. For $t<0$, dividing reverses the inequality after division by a negative number:
\begin{align*}
0 \geq -a_j+\frac{1}{2}t|x_j|^2+\lambda.
\end{align*}
Letting $t \uparrow 0$ gives $a_j \geq \lambda$. Therefore $a_j=\lambda$.
Suppose next that $\hat{\beta}_j<0$. For all sufficiently small $t$, the number $\hat{\beta}_j+t$ remains negative, so
\begin{align*}
|\hat{\beta}_j+t|-|\hat{\beta}_j| = -t.
\end{align*}
The variational inequality becomes
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 - \lambda t.
\end{align*}
Dividing by $t>0$ and letting $t \downarrow 0$ gives $a_j \leq -\lambda$. Dividing by $t<0$ and letting $t \uparrow 0$ gives $a_j \geq -\lambda$. Hence $a_j=-\lambda$.
Finally suppose that $\hat{\beta}_j=0$. Then
\begin{align*}
|\hat{\beta}_j+t|-|\hat{\beta}_j| = |t|,
\end{align*}
so
\begin{align*}
0 \leq -t a_j + \frac{1}{2}t^2|x_j|^2 + \lambda |t|.
\end{align*}
For $t>0$, divide by $t$ and let $t \downarrow 0$ to obtain $a_j \leq \lambda$. For $t<0$, divide by $t$ and let $t \uparrow 0$ to obtain $a_j \geq -\lambda$. Therefore $|a_j|\leq \lambda$.
Since $a_j=x_j^\top(y-X\hat{\beta})$, this proves the stated coordinate conditions.[/guided]
custom_env
admin
[step:Package the coordinate conditions as the subgradient vector condition]
Assume the coordinate conditions hold. Define $z \in \mathbb{R}^p$ coordinatewise by choosing
\begin{align*}
z_j :=
\begin{cases}
\operatorname{sgn}(\hat{\beta}_j), & \hat{\beta}_j \neq 0,\\
\lambda^{-1}x_j^\top(y-X\hat{\beta}), & \hat{\beta}_j = 0 \text{ and } \lambda>0,\\
0, & \hat{\beta}_j = 0 \text{ and } \lambda=0.
\end{cases}
\end{align*}
If $\hat{\beta}_j \neq 0$, then $z_j \in \partial |\hat{\beta}_j|$ by the definition of $\partial |\cdot|$. If $\hat{\beta}_j=0$ and $\lambda>0$, then $|x_j^\top(y-X\hat{\beta})|\leq \lambda$ gives $z_j \in [-1,1]=\partial |0|$. If $\hat{\beta}_j=0$ and $\lambda=0$, then $z_j=0 \in [-1,1]=\partial |0|$.
For $\hat{\beta}_j \neq 0$, the coordinate condition gives
\begin{align*}
x_j^\top(y-X\hat{\beta}) = \lambda z_j.
\end{align*}
For $\hat{\beta}_j=0$ and $\lambda>0$, this equality holds by the definition of $z_j$. For $\hat{\beta}_j=0$ and $\lambda=0$, the coordinate inequality gives $|x_j^\top(y-X\hat{\beta})|\leq 0$, hence $x_j^\top(y-X\hat{\beta})=0=\lambda z_j$. Therefore
\begin{align*}
X^\top(y-X\hat{\beta})=\lambda z,
\end{align*}
and $z_j\in \partial|\hat{\beta}_j|$ for every $j$.
Combining this construction with the sufficiency already proved shows that the subgradient vector condition, the coordinate conditions, and global minimality of $\hat{\beta}$ are equivalent.
[/step]