[proofplan]
We prove the result by computing the subgradient of the weighted $\ell^1$ penalty coordinate by coordinate. The squared loss is differentiable, and its gradient is
\begin{align*}
-\frac{1}{n}X^\top(Y-X\beta).
\end{align*}
The weighted absolute-value penalty contributes exactly vectors of the form $Wz$, where $z_j$ is the sign on nonzero coordinates and any number in $[-1,1]$ on zero coordinates. The displayed equation is therefore precisely the first-order optimality condition for the convex objective.
[/proofplan]
custom_env
admin
[step:Define the objective and compute the gradient of the squared loss]
Define the weighted Lasso objective function $Q: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
Q(\beta) := \frac{1}{2n}|Y - X\beta|^2 + \lambda \sum_{j=1}^p w_j|\beta_j|.
\end{align*}
Also define the differentiable squared-loss part $L: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
L(\beta) := \frac{1}{2n}|Y - X\beta|^2.
\end{align*}
For every $\beta,h \in \mathbb{R}^p$, expanding the Euclidean square gives
\begin{align*}
L(\beta + h) = \frac{1}{2n}|Y - X\beta - Xh|^2 = \frac{1}{2n}|Y - X\beta|^2 - \frac{1}{n}(X^\top(Y-X\beta)) \cdot h + \frac{1}{2n}|Xh|^2.
\end{align*}
Thus the gradient of $L$ at $\beta$ is
\begin{align*}
\nabla L(\beta) = -\frac{1}{n}X^\top(Y-X\beta).
\end{align*}
[/step]
custom_env
admin
[step:Compute the subgradients of the weighted absolute-value penalty]Define the weighted penalty $P: \mathbb{R}^p \to \mathbb{R}$ by
\begin{align*}
P(\beta) := \sum_{j=1}^p w_j|\beta_j|.
\end{align*}
For $a \in \mathbb{R}$, the subgradient set of $t \mapsto |t|$ at $a$ is described as follows. If $a \neq 0$, then
\begin{align*}
\partial |a| = \{\operatorname{sgn}(a)\}.
\end{align*}
If $a = 0$, then
\begin{align*}
\partial |0| = [-1,1].
\end{align*}
Indeed, if $a \neq 0$, then $|t| \geq |a| + \operatorname{sgn}(a)(t-a)$ for every $t \in \mathbb{R}$. If $a=0$, the inequality $|t| \geq s t$ holds for every $t \in \mathbb{R}$ exactly when $s \in [-1,1]$.
Therefore, for $\beta \in \mathbb{R}^p$, a vector $g \in \mathbb{R}^p$ is a subgradient of $P$ at $\beta$ exactly when there exists $z \in \mathbb{R}^p$ such that
\begin{align*}
g = Wz,
\end{align*}
with
\begin{align*}
z_j = \operatorname{sgn}(\beta_j) \quad \text{if } \beta_j \neq 0,
\end{align*}
and
\begin{align*}
z_j \in [-1,1] \quad \text{if } \beta_j = 0.
\end{align*}[/step]
custom_env
admin
[guided]The penalty is separable, so its subgradient can be computed one coordinate at a time. First consider the scalar function $r: \mathbb{R} \to \mathbb{R}$ defined by $r(t) := |t|$. A number $s \in \mathbb{R}$ is a subgradient of $r$ at $a \in \mathbb{R}$ precisely when
\begin{align*}
|t| \geq |a| + s(t-a)
\end{align*}
for every $t \in \mathbb{R}$.
If $a>0$, then $|a|=a$ and the supporting line has slope $1$. The inequality
\begin{align*}
|t| \geq a + (t-a) = t
\end{align*}
holds for every $t \in \mathbb{R}$, and no other slope supports $|t|$ at $a$ because $|t|$ is differentiable at $a$ with derivative $1$. If $a<0$, the same argument gives the unique slope $-1$. Thus, when $a \neq 0$,
\begin{align*}
\partial |a| = \{\operatorname{sgn}(a)\}.
\end{align*}
At $a=0$, the supporting inequality becomes
\begin{align*}
|t| \geq st
\end{align*}
for every $t \in \mathbb{R}$. Taking $t>0$ gives $s \leq 1$, while taking $t<0$ gives $s \geq -1$. Hence the inequality holds exactly for $s \in [-1,1]$, and
\begin{align*}
\partial |0| = [-1,1].
\end{align*}
Now return to the vector penalty $P: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
P(\beta) := \sum_{j=1}^p w_j|\beta_j|.
\end{align*}
Because $P$ is a finite sum of one-dimensional convex functions applied to separate coordinates, a vector $g \in \mathbb{R}^p$ is a subgradient of $P$ at $\beta$ exactly when each coordinate has the form
\begin{align*}
g_j = w_j z_j,
\end{align*}
where
\begin{align*}
z_j = \operatorname{sgn}(\beta_j) \quad \text{if } \beta_j \neq 0,
\end{align*}
and
\begin{align*}
z_j \in [-1,1] \quad \text{if } \beta_j = 0.
\end{align*}
In vector notation this is exactly $g = Wz$.[/guided]
custom_env
admin
[step:Derive the KKT condition from optimality]
Assume that $\hat{\beta}^w$ minimizes $Q$ over $\mathbb{R}^p$. The function $L$ is finite, convex, and differentiable on $\mathbb{R}^p$, because it is a nonnegative quadratic function. The function $P$ is finite and convex on $\mathbb{R}^p$, because each coordinate function $\beta_j \mapsto w_j|\beta_j|$ is convex and the sum is finite. By the [subgradient first-order optimality condition for convex functions](/page/Subgradient) and the finite-sum rule for [subgradients](/page/Subgradient), $\hat{\beta}^w$ minimizes $Q=L+\lambda P$ exactly when there exists $g \in \partial P(\hat{\beta}^w)$ such that
\begin{align*}
0 = \nabla L(\hat{\beta}^w) + \lambda g.
\end{align*}
Using the gradient computed above,
\begin{align*}
0 = -\frac{1}{n}X^\top(Y-X\hat{\beta}^w) + \lambda g.
\end{align*}
By the subgradient description of $P$, there exists $z \in \mathbb{R}^p$ such that $g=Wz$ and
\begin{align*}
z_j = \operatorname{sgn}(\hat{\beta}^w_j)
\quad \text{if } \hat{\beta}^w_j \neq 0,
\end{align*}
while
\begin{align*}
z_j \in [-1,1]
\quad \text{if } \hat{\beta}^w_j = 0.
\end{align*}
Substituting $g=Wz$ into the first-order condition gives
\begin{align*}
\frac{1}{n}X^\top(Y-X\hat{\beta}^w) = \lambda Wz.
\end{align*}
[/step]
custom_env
admin
[step:Use the KKT condition to prove optimality]
Conversely, assume that there exists $z \in \mathbb{R}^p$ satisfying the displayed equation and the coordinate conditions. Define $g := Wz \in \mathbb{R}^p$. By the coordinate conditions and the subgradient computation above, $g \in \partial P(\hat{\beta}^w)$.
Let $\beta \in \mathbb{R}^p$ be arbitrary. Since $g \in \partial P(\hat{\beta}^w)$, the defining [subgradient](/page/Subgradient) inequality for $P$ at $\hat{\beta}^w$ gives
\begin{align*}
P(\beta) \geq P(\hat{\beta}^w) + g \cdot (\beta-\hat{\beta}^w).
\end{align*}
The convexity and differentiability of $L$ give
\begin{align*}
L(\beta) \geq L(\hat{\beta}^w) + \nabla L(\hat{\beta}^w)\cdot(\beta-\hat{\beta}^w).
\end{align*}
Adding these inequalities after multiplying the first by $\lambda$ yields
\begin{align*}
Q(\beta)
&\geq Q(\hat{\beta}^w)
+ \left(\nabla L(\hat{\beta}^w)+\lambda g\right)\cdot(\beta-\hat{\beta}^w).
\end{align*}
The assumed stationarity equation is equivalent to
\begin{align*}
\nabla L(\hat{\beta}^w)+\lambda g = 0.
\end{align*}
Therefore
\begin{align*}
Q(\beta) \geq Q(\hat{\beta}^w).
\end{align*}
Since $\beta \in \mathbb{R}^p$ was arbitrary, $\hat{\beta}^w$ minimizes $Q$ over $\mathbb{R}^p$. Hence $\hat{\beta}^w$ is a weighted Lasso estimator.
[/step]