Soft-Thresholding Formula for the Lasso with Orthonormal Design

Soft-Thresholding Formula for the Lasso with Orthonormal Design (Theorem # 5556)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We first rewrite the Lasso objective using the identity $\frac{1}{n}X^\top X=I_p$. This converts the quadratic loss into a constant plus a squared Euclidean distance from $\beta$ to $z$, so the minimisation separates into $p$ independent one-dimensional convex problems. We then solve a generic scalar problem exactly by considering whether the scalar datum lies above $\lambda$, inside $[-\lambda,\lambda]$, or below $-\lambda$. The strict convexity of the separated scalar objectives gives uniqueness of each coordinate and therefore uniqueness of the full minimiser. [/proofplan] [step:Show that the orthonormal design assumption forces $p \leq n$] The hypothesis $X^\top X/n=I_p$ is equivalent to $X^\top X = n I_p$. Hence the matrix $X^\top X$ is invertible, so the $p$ columns of $X$ are linearly independent vectors in $\mathbb{R}^n$. A linearly independent family in $\mathbb{R}^n$ has at most $n$ elements, so $p \leq n$. [/step] [step:Rewrite the Lasso objective as a separable sum] Define the objective function $Q: \mathbb{R}^p \to \mathbb{R}$ by \begin{align*} Q(\beta) := \frac{1}{2n}|Y-X\beta|^2+\lambda\sum_{j=1}^p |\beta_j|. \end{align*} For $\beta \in \mathbb{R}^p$, expand the squared norm using the Euclidean [inner product](/page/Inner%20Product). The expansion is \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}|Y|^2 - \frac{1}{n}\beta^\top X^\top Y + \frac{1}{2n}\beta^\top X^\top X\beta. \end{align*} Using $z = X^\top Y/n$ and $X^\top X/n = I_p$, this becomes \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}|Y|^2-\beta^\top z+\frac{1}{2}|\beta|^2. \end{align*} Completing the square gives \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}|Y|^2 -\frac{1}{2}|z|^2 +\frac{1}{2}|\beta-z|^2. \end{align*} Define the constant $C \in \mathbb{R}$ by \begin{align*} C := \frac{1}{2n}|Y|^2-\frac{1}{2}|z|^2. \end{align*} Then \begin{align*} Q(\beta) = C+\sum_{j=1}^p \left(\frac{1}{2}(\beta_j-z_j)^2+\lambda|\beta_j|\right). \end{align*} Thus minimising $Q$ over $\mathbb{R}^p$ is equivalent to minimising each coordinate function independently. [guided] The reason the orthonormal design hypothesis is powerful is that it removes all interactions between different coordinates of $\beta$. Define the function $Q: \mathbb{R}^p \to \mathbb{R}$ by \begin{align*} Q(\beta) := \frac{1}{2n}|Y-X\beta|^2+\lambda\sum_{j=1}^p |\beta_j|. \end{align*} We expand the quadratic term. For every $\beta \in \mathbb{R}^p$, \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}(Y-X\beta)^\top(Y-X\beta). \end{align*} Expanding the Euclidean inner product gives \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}|Y|^2 -\frac{1}{n}\beta^\top X^\top Y +\frac{1}{2n}\beta^\top X^\top X\beta. \end{align*} Now use the definitions The definition of $z$ gives \begin{align*} z = \frac{1}{n}X^\top Y. \end{align*} The orthonormal design hypothesis gives \begin{align*} \frac{1}{n}X^\top X = I_p. \end{align*} These give \begin{align*} \frac{1}{2n}|Y-X\beta|^2 = \frac{1}{2n}|Y|^2-\beta^\top z+\frac{1}{2}|\beta|^2. \end{align*} Completing the square in $\beta$ gives \begin{align*} \frac{1}{2}|\beta|^2-\beta^\top z = \frac{1}{2}|\beta-z|^2-\frac{1}{2}|z|^2. \end{align*} Therefore, if \begin{align*} C := \frac{1}{2n}|Y|^2-\frac{1}{2}|z|^2, \end{align*} then \begin{align*} Q(\beta) = C+\frac{1}{2}|\beta-z|^2+\lambda\sum_{j=1}^p |\beta_j|. \end{align*} Since $|\beta-z|^2 = \sum_{j=1}^p (\beta_j-z_j)^2$, this is \begin{align*} Q(\beta) = C+\sum_{j=1}^p \left( \frac{1}{2}(\beta_j-z_j)^2+\lambda|\beta_j| \right). \end{align*} The constant $C$ does not affect the minimiser. What remains is a sum of terms where the $j$th term depends only on $\beta_j$. Hence the full $p$-dimensional minimisation problem reduces exactly to $p$ independent one-dimensional minimisation problems. [/guided] [/step] [step:Solve the one-dimensional soft-thresholding problem] Fix $t \in \mathbb{R}$ and define $\phi_t: \mathbb{R} \to \mathbb{R}$ by \begin{align*} \phi_t(u) := \frac{1}{2}(u-t)^2+\lambda|u|. \end{align*} We claim that $\phi_t$ has the unique minimiser $S_\lambda(t)$. If $u>0$, then \begin{align*} \phi_t(u)=\frac{1}{2}(u-t)^2+\lambda u, \qquad \phi_t'(u)=u-t+\lambda. \end{align*} The critical point in $(0,\infty)$ is $u=t-\lambda$, which lies in $(0,\infty)$ exactly when $t>\lambda$. If $u<0$, then \begin{align*} \phi_t(u)=\frac{1}{2}(u-t)^2-\lambda u, \qquad \phi_t'(u)=u-t-\lambda. \end{align*} The critical point in $(-\infty,0)$ is $u=t+\lambda$, which lies in $(-\infty,0)$ exactly when $t<-\lambda$. It remains to handle $u=0$. For $u>0$, \begin{align*} \phi_t'(u)=u-t+\lambda, \end{align*} so the right derivative at $0$ is $\lambda-t$. For $u<0$, \begin{align*} \phi_t'(u)=u-t-\lambda, \end{align*} so the left derivative at $0$ is $-\lambda-t$. We verify the minimisation condition at $0$ directly. If $u>0$, then \begin{align*} \phi_t(u)-\phi_t(0) = \frac{1}{2}u^2+u(\lambda-t). \end{align*} If $u<0$, then \begin{align*} \phi_t(u)-\phi_t(0) = \frac{1}{2}u^2-u(\lambda+t). \end{align*} Therefore $\phi_t(u) \geq \phi_t(0)$ for every $u \in \mathbb{R}$ exactly when $\lambda-t \geq 0$ and $\lambda+t \geq 0$, equivalently when \begin{align*} -\lambda \leq t \leq \lambda. \end{align*} Thus $0$ minimises $\phi_t$ exactly when $|t|\leq \lambda$. Combining the three cases, the only candidate selected by the derivative computations is $t-\lambda$ when $t>\lambda$, is $0$ when $|t|\leq \lambda$, and is $t+\lambda$ when $t<-\lambda$. This is exactly the soft-thresholding value \begin{align*} S_\lambda(t)=\operatorname{sgn}(t)(|t|-\lambda)_+. \end{align*} It remains only to justify that these candidates are global and unique minimisers; we do this directly rather than citing a convex-analysis theorem. Since $u \mapsto \frac{1}{2}(u-t)^2$ is strictly convex and coercive on $\mathbb{R}$, and since $u \mapsto \lambda|u|$ is convex, the function $\phi_t$ is strictly convex and coercive. The one-sided derivative computations show that, in the case $t>\lambda$, the derivative is negative on $(-\infty,t-\lambda)$ and positive on $(t-\lambda,\infty)$, so $t-\lambda$ is the global minimiser. In the case $t<-\lambda$, the derivative is negative on $(-\infty,t+\lambda)$ and positive on $(t+\lambda,\infty)$, so $t+\lambda$ is the global minimiser. In the case $|t|\leq \lambda$, the direct difference estimates above show $\phi_t(u)\geq \phi_t(0)$ for every $u\in\mathbb{R}$. Strict convexity then makes the global minimiser unique in all three cases. Hence \begin{align*} \operatorname*{arg\,min}_{u \in \mathbb{R}} \phi_t(u)=S_\lambda(t). \end{align*} [/step] [step:Assemble the coordinatewise minimisers and prove uniqueness] For each $j \in \{1,\dots,p\}$, define $\phi_{z_j}: \mathbb{R} \to \mathbb{R}$ by \begin{align*} \phi_{z_j}(u) := \frac{1}{2}(u-z_j)^2+\lambda|u|. \end{align*} From the previous step, the unique minimiser of $\phi_{z_j}$ is $S_\lambda(z_j)$. Therefore the unique minimiser of the separated sum $\sum_{j=1}^p \phi_{z_j}(\beta_j)$ is the vector $\hat{\beta} \in \mathbb{R}^p$ defined coordinatewise by \begin{align*} \hat{\beta}_j := S_\lambda(z_j), \qquad j=1,\dots,p. \end{align*} Adding the constant $C$ does not change the minimiser, so the same vector uniquely minimises $Q$. This is exactly the penalised Lasso solution, and the stated formula follows. [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Inner Product

Explore Further

Inner Product Definition Student's t Distribution of the Regression Coefficient Test Statistic Probability & Statistics Martingale Prediction Principle Probability Theory Independence of Fitted Values and Residuals in the Normal Linear Model Probability & Statistics Doob-Dynkin Representation Probability & Statistics Kolmogorov's Continuity Criterion Stochastic Processes Addition Property of the Poisson Distribution Poisson Processes Basic Properties of the Ordinary Least Squares Hat Matrix Probability & Statistics Oracle Inequality for the Lasso under a Uniform Compatibility Condition Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.