[proofplan]
We compute the first-order directional condition for the convex Lasso objective. The differentiable squared-error term contributes the gradient $-X^\top r/n$, while the one-dimensional absolute value term contributes either the sign of an active coordinate or an interval of admissible slopes at an inactive coordinate. These coordinate conditions give a vector subgradient containing the origin, which proves necessity; the subgradient inequality for the convex objective proves sufficiency.
[/proofplan]
custom_env
admin
[step:Compute the first variation of the squared-error term]
Fix $\lambda > 0$ and write $\hat{\beta} = \hat{\beta}(\lambda)$, $\hat{r} = \hat{r}(\lambda)$, and $\hat{S} = \hat{S}(\lambda)$. Define the differentiable squared-error function $F: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
F(\beta) = \frac{1}{2n}|Y - X\beta|^2.
\end{align*}
For $h \in \mathbb{R}^p$ and $t \in \mathbb{R}$, expanding the square first gives
\begin{align*}
F(\hat{\beta} + th) = \frac{1}{2n}|\hat{r} - tXh|^2.
\end{align*}
Using bilinearity of the Euclidean [inner product](/page/Inner%20Product) on $\mathbb{R}^n$, this becomes
\begin{align*}
F(\hat{\beta} + th) = \frac{1}{2n}|\hat{r}|^2 - \frac{t}{n}\hat{r}^\top Xh + \frac{t^2}{2n}|Xh|^2.
\end{align*}
Therefore the directional derivative of $F$ at $\hat{\beta}$ in the direction $h$ is
\begin{align*}
\left.\frac{d}{dt}\right|_{t=0} F(\hat{\beta} + th) = -\frac{1}{n}\hat{r}^\top Xh.
\end{align*}
Writing the matrix product coordinatewise gives
\begin{align*}
\left.\frac{d}{dt}\right|_{t=0} F(\hat{\beta} + th) = \sum_{j=1}^p \left(-\frac{X_j^\top \hat{r}}{n}\right)h_j.
\end{align*}
Thus the gradient of $F$ at $\hat{\beta}$ is
\begin{align*}
\nabla F(\hat{\beta}) = -\frac{1}{n}X^\top \hat{r}.
\end{align*}
[/step]
custom_env
admin
[step:Describe the coordinate subgradients of the $\ell^1$ term]For each coordinate $j \in \{1,\dots,p\}$, define the one-dimensional function $\varphi_j: \mathbb{R} \to \mathbb{R}$ by
\begin{align*}
\varphi_j(t) = |t|.
\end{align*}
If $\hat{\beta}_j \neq 0$, then $\varphi_j$ is differentiable at $\hat{\beta}_j$ and
\begin{align*}
\varphi_j'(\hat{\beta}_j) = \operatorname{sgn}(\hat{\beta}_j).
\end{align*}
If $\hat{\beta}_j = 0$, then the subgradient inequality for $\varphi_j$ at $0$ is
\begin{align*}
|t| \geq z_j t \quad \text{for all } t \in \mathbb{R}.
\end{align*}
This holds exactly for $z_j \in [-1,1]$. Hence a vector $z \in \mathbb{R}^p$ is a subgradient of $\|\cdot\|_1$ at $\hat{\beta}$ precisely when
\begin{align*}
z_j = \operatorname{sgn}(\hat{\beta}_j) \quad \text{for } j \in \hat{S},
\end{align*}
and
\begin{align*}
z_j \in [-1,1] \quad \text{for } j \notin \hat{S}.
\end{align*}[/step]
custom_env
admin
[guided]The only nonsmooth part of the Lasso objective is the absolute value at zero, so we reduce the subgradient computation to one coordinate at a time. For each $j \in \{1,\dots,p\}$, define $\varphi_j: \mathbb{R} \to \mathbb{R}$ by
\begin{align*}
\varphi_j(t) = |t|.
\end{align*}
If $\hat{\beta}_j > 0$, then $\varphi_j(t) = t$ for all $t$ in a neighbourhood of $\hat{\beta}_j$, so the derivative is $1$. If $\hat{\beta}_j < 0$, then $\varphi_j(t) = -t$ for all $t$ in a neighbourhood of $\hat{\beta}_j$, so the derivative is $-1$. Combining these two active-coordinate cases gives
\begin{align*}
\varphi_j'(\hat{\beta}_j) = \operatorname{sgn}(\hat{\beta}_j)
\quad \text{whenever } \hat{\beta}_j \neq 0.
\end{align*}
At an inactive coordinate, $\hat{\beta}_j = 0$, the function $\varphi_j$ is not differentiable. We therefore ask which [real numbers](/page/Real%20Numbers) $z_j$ give a supporting affine function at $0$. The required inequality is
\begin{align*}
|t| \geq z_j t \quad \text{for all } t \in \mathbb{R}.
\end{align*}
Taking $t > 0$ gives $t \geq z_j t$, hence $z_j \leq 1$. Taking $t < 0$ gives $-t \geq z_j t$, and dividing by the negative number $t$ reverses the inequality, giving $z_j \geq -1$. Therefore $z_j \in [-1,1]$ is necessary. Conversely, if $z_j \in [-1,1]$, then $z_j t \leq |z_j||t| \leq |t|$ for every $t \in \mathbb{R}$, so the inequality holds.
Since
\begin{align*}
\|\beta\|_1 = \sum_{j=1}^p |\beta_j|,
\end{align*}
the subgradient of $\|\cdot\|_1$ is obtained coordinate by coordinate. Thus $z \in \mathbb{R}^p$ is a subgradient of $\|\cdot\|_1$ at $\hat{\beta}$ precisely when
\begin{align*}
z_j = \operatorname{sgn}(\hat{\beta}_j) \quad \text{for } j \in \hat{S},
\end{align*}
and
\begin{align*}
z_j \in [-1,1] \quad \text{for } j \notin \hat{S}.
\end{align*}[/guided]
custom_env
admin
[step:Derive the active and inactive coordinate conditions from optimality]
Since $\hat{\beta}$ minimizes the convex function $Q_\lambda = F + \lambda\|\cdot\|_1$, there exists a subgradient vector $z \in \mathbb{R}^p$ of $\|\cdot\|_1$ at $\hat{\beta}$ such that
\begin{align*}
0 = \nabla F(\hat{\beta}) + \lambda z.
\end{align*}
Using $\nabla F(\hat{\beta}) = -X^\top \hat{r}/n$, this becomes
\begin{align*}
\frac{X^\top \hat{r}}{n} = \lambda z.
\end{align*}
Taking the $j$th coordinate gives
\begin{align*}
\frac{X_j^\top \hat{r}}{n} = \lambda z_j.
\end{align*}
If $j \in \hat{S}$, then $z_j = \operatorname{sgn}(\hat{\beta}_j)$, so
\begin{align*}
\frac{X_j^\top \hat{r}}{n} = \lambda \operatorname{sgn}(\hat{\beta}_j).
\end{align*}
If $j \notin \hat{S}$, then $z_j \in [-1,1]$, so
\begin{align*}
\left|\frac{X_j^\top \hat{r}}{n}\right|
= \lambda |z_j|
\leq \lambda.
\end{align*}
These are the asserted active and inactive coordinate conditions.
[/step]
custom_env
admin
[step:Prove that the coordinate conditions imply optimality]
Let $\beta \in \mathbb{R}^p$ and $r = Y - X\beta$. Assume that
\begin{align*}
\frac{X_j^\top r}{n} = \lambda \operatorname{sgn}(\beta_j)
\quad \text{for every } j \text{ with } \beta_j \neq 0,
\end{align*}
and
\begin{align*}
\left|\frac{X_j^\top r}{n}\right| \leq \lambda
\quad \text{for every } j \text{ with } \beta_j = 0.
\end{align*}
Define $z \in \mathbb{R}^p$ coordinatewise as follows. For each $j \in \{1,\dots,p\}$ with $\beta_j \neq 0$, set
\begin{align*}
z_j = \operatorname{sgn}(\beta_j).
\end{align*}
For each $j \in \{1,\dots,p\}$ with $\beta_j = 0$, set
\begin{align*}
z_j = \frac{X_j^\top r}{n\lambda}.
\end{align*}
The assumed inactive-coordinate inequality gives $z_j \in [-1,1]$ whenever $\beta_j = 0$, so $z$ is a subgradient of $\|\cdot\|_1$ at $\beta$. The active and inactive definitions also give
\begin{align*}
\frac{X^\top r}{n} = \lambda z,
\end{align*}
hence
\begin{align*}
0 = -\frac{X^\top r}{n} + \lambda z = \nabla F(\beta) + \lambda z.
\end{align*}
For any $\gamma \in \mathbb{R}^p$, convexity of $F$ gives
\begin{align*}
F(\gamma) \geq F(\beta) + \nabla F(\beta)^\top(\gamma-\beta).
\end{align*}
The subgradient inequality for $\|\cdot\|_1$ gives
\begin{align*}
\|\gamma\|_1 \geq \|\beta\|_1 + z^\top(\gamma-\beta).
\end{align*}
Multiplying the [second inequality](/theorems/2899) by $\lambda$ and adding yields
\begin{align*}
Q_\lambda(\gamma) \geq Q_\lambda(\beta) + \left(\nabla F(\beta) + \lambda z\right)^\top(\gamma-\beta).
\end{align*}
Since $\nabla F(\beta) + \lambda z = 0$, the preceding inequality reduces to
\begin{align*}
Q_\lambda(\gamma) \geq Q_\lambda(\beta).
\end{align*}
Thus $\beta$ minimizes $Q_\lambda$ over $\mathbb{R}^p$. This proves the converse characterization and completes the proof.
[/step]