Lasso Cone Condition — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof starts from the optimality of $\hat{\beta}$ in the Lasso objective and compares the objective value at $\hat{\beta}$ with the value at the true vector $\beta^*$. After substituting $Y = X\beta^* + \varepsilon$, the resulting basic inequality contains a quadratic prediction term, a stochastic score term, and the difference of two $\ell^1$ penalties. On $\mathcal{E}_\lambda$, the score term is bounded by one half of $\lambda\|\Delta\|_1$, while the support condition $\beta^*_{S^c}=0$ makes the penalty difference separate into an $S$ part and an $S^c$ part. The non-negativity of the quadratic prediction term then forces the negative $S^c$ contribution not to dominate the positive $S$ contribution. [/proofplan] [step:Derive the basic inequality from Lasso optimality] Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, evaluating the same objective at $\beta^*$ gives \begin{align*} \frac{1}{2n}\|Y - X\hat{\beta}\|_2^2 + \lambda \|\hat{\beta}\|_1 \leq \frac{1}{2n}\|Y - X\beta^*\|_2^2 + \lambda \|\beta^*\|_1. \end{align*} Using $Y = X\beta^* + \varepsilon$ and $\Delta = \hat{\beta} - \beta^*$, we have \begin{align*} Y - X\hat{\beta} = X\beta^* + \varepsilon - X\hat{\beta} = \varepsilon - X\Delta, \qquad Y - X\beta^* = \varepsilon. \end{align*} Substituting these identities into the optimality inequality yields \begin{align*} \frac{1}{2n}\|\varepsilon - X\Delta\|_2^2 + \lambda \|\hat{\beta}\|_1 \leq \frac{1}{2n}\|\varepsilon\|_2^2 + \lambda \|\beta^*\|_1. \end{align*} Expanding the square in the Euclidean norm gives \begin{align*} \|\varepsilon - X\Delta\|_2^2 = \|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2. \end{align*} After cancellation of $\|\varepsilon\|_2^2/(2n)$ and rearrangement, we obtain the basic inequality \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{1}{n}\varepsilon^\top X\Delta + \lambda\bigl(\|\beta^*\|_1 - \|\hat{\beta}\|_1\bigr). \end{align*} [guided] The only input in this step is the defining property of $\hat{\beta}$ as a minimizer. The Lasso objective is the map $Q_\lambda: \mathbb{R}^p \to \mathbb{R}$ defined by \begin{align*} Q_\lambda(\beta) := \frac{1}{2n}\|Y - X\beta\|_2^2 + \lambda\|\beta\|_1. \end{align*} Since $\hat{\beta}$ minimizes $Q_\lambda$ over $\mathbb{R}^p$, the comparison vector $\beta^*$ is admissible, and hence \begin{align*} Q_\lambda(\hat{\beta}) \leq Q_\lambda(\beta^*). \end{align*} Writing this out gives \begin{align*} \frac{1}{2n}\|Y - X\hat{\beta}\|_2^2 + \lambda \|\hat{\beta}\|_1 \leq \frac{1}{2n}\|Y - X\beta^*\|_2^2 + \lambda \|\beta^*\|_1. \end{align*} Now we use the statistical model. The response is $Y = X\beta^* + \varepsilon$, and the error vector is $\Delta = \hat{\beta} - \beta^*$. Therefore \begin{align*} Y - X\hat{\beta} = X\beta^* + \varepsilon - X\hat{\beta}. \end{align*} Since $\Delta=\hat{\beta}-\beta^*$, this becomes \begin{align*} Y - X\hat{\beta} = \varepsilon - X\Delta. \end{align*} while \begin{align*} Y - X\beta^* = \varepsilon. \end{align*} Substituting these two identities into the objective comparison gives \begin{align*} \frac{1}{2n}\|\varepsilon - X\Delta\|_2^2 + \lambda \|\hat{\beta}\|_1 \leq \frac{1}{2n}\|\varepsilon\|_2^2 + \lambda \|\beta^*\|_1. \end{align*} The purpose of expanding the square is to isolate the prediction error $\|X\Delta\|_2^2$. Using the Euclidean identity \begin{align*} |a-b|^2 = |a|^2 - 2a^\top b + |b|^2 \end{align*} with $a=\varepsilon$ and $b=X\Delta$, we get \begin{align*} \|\varepsilon - X\Delta\|_2^2 = \|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2. \end{align*} Thus \begin{align*} \frac{1}{2n} \left( \|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2 \right) + \lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}\|\varepsilon\|_2^2 + \lambda\|\beta^*\|_1. \end{align*} Canceling $\|\varepsilon\|_2^2/(2n)$ from both sides and moving the remaining terms gives \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{1}{n}\varepsilon^\top X\Delta + \lambda\bigl(\|\beta^*\|_1 - \|\hat{\beta}\|_1\bigr). \end{align*} This is the basic inequality: it converts optimality of the penalized estimator into a deterministic inequality involving the prediction error, the score term, and the penalty difference. [/guided] [/step] [step:Control the score term on $\mathcal{E}_\lambda$] Define the score vector $z \in \mathbb{R}^p$ by \begin{align*} z := \frac{1}{n}X^\top \varepsilon. \end{align*} Then \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta = z^\top \Delta. \end{align*} On $\mathcal{E}_\lambda$, we have \begin{align*} \|z\|_\infty \leq \frac{\lambda}{2}. \end{align*} The duality inequality between the $\ell^\infty$ and $\ell^1$ norms gives \begin{align*} z^\top \Delta \leq |z^\top \Delta| \leq \|z\|_\infty \|\Delta\|_1 \leq \frac{\lambda}{2}\|\Delta\|_1. \end{align*} Since $\Delta_S$ and $\Delta_{S^c}$ have disjoint supports, \begin{align*} \|\Delta\|_1 = \|\Delta_S\|_1 + \|\Delta_{S^c}\|_1. \end{align*} Therefore, on $\mathcal{E}_\lambda$, \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta \leq \frac{\lambda}{2}\|\Delta_S\|_1 + \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} [guided] The score term is the only random linear term left in the basic inequality, so we rewrite it as an [inner product](/page/Inner%20Product) with a vector whose sup norm is controlled on $\mathcal{E}_\lambda$. Define \begin{align*} z := \frac{1}{n}X^\top \varepsilon \in \mathbb{R}^p. \end{align*} By associativity of matrix multiplication and the definition of the Euclidean inner product on $\mathbb{R}^p$, \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta = \left(\frac{1}{n}X^\top\varepsilon\right)^\top\Delta = z^\top\Delta. \end{align*} The event $\mathcal{E}_\lambda$ is precisely the event on which this score vector is small in $\ell^\infty$ norm, namely \begin{align*} \|z\|_\infty \leq \frac{\lambda}{2}. \end{align*} We now pair this sup-norm control with the $\ell^1$ size of the error. The duality inequality between $\ell^\infty$ and $\ell^1$ on $\mathbb{R}^p$ states that for any $a,b \in \mathbb{R}^p$, \begin{align*} |a^\top b| \leq \|a\|_\infty\|b\|_1. \end{align*} Applying this inequality with $a=z$ and $b=\Delta$ gives \begin{align*} z^\top\Delta \leq |z^\top\Delta| \leq \|z\|_\infty\|\Delta\|_1 \leq \frac{\lambda}{2}\|\Delta\|_1. \end{align*} Finally, we split the $\ell^1$ norm of $\Delta$ along the two disjoint coordinate sets $S$ and $S^c$. Since these supports are disjoint, the coordinate sums defining the $\ell^1$ norm separate: \begin{align*} \|\Delta\|_1 = \|\Delta_S\|_1+\|\Delta_{S^c}\|_1. \end{align*} Substituting this decomposition into the score bound yields \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta \leq \frac{\lambda}{2}\|\Delta_S\|_1 + \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} [/guided] [/step] [step:Decompose the penalty difference using the support of $\beta^*$] Because $S=\operatorname{supp}(\beta^*)$, we have $\beta^*_{S^c}=0$. Since $\hat{\beta}=\beta^*+\Delta$, the restrictions of $\hat{\beta}$ to $S$ and $S^c$ are \begin{align*} \hat{\beta}_S = \beta^*_S + \Delta_S, \qquad \hat{\beta}_{S^c} = \Delta_{S^c}. \end{align*} Using additivity of the $\ell^1$ norm over disjoint coordinate sets, \begin{align*} \|\beta^*\|_1 - \|\hat{\beta}\|_1 &= \|\beta^*_S\|_1 - \|\beta^*_S+\Delta_S\|_1 - \|\Delta_{S^c}\|_1. \end{align*} The [reverse triangle inequality](/theorems/2300) gives \begin{align*} \|\beta^*_S\|_1 - \|\beta^*_S+\Delta_S\|_1 \leq \|\Delta_S\|_1. \end{align*} Consequently, \begin{align*} \|\beta^*\|_1 - \|\hat{\beta}\|_1 \leq \|\Delta_S\|_1 - \|\Delta_{S^c}\|_1. \end{align*} [guided] The support condition is what creates the cone. Since $S=\operatorname{supp}(\beta^*)$, all coordinates of $\beta^*$ outside $S$ vanish, which means \begin{align*} \beta^*_{S^c}=0. \end{align*} Because $\Delta=\hat{\beta}-\beta^*$, we also have $\hat{\beta}=\beta^*+\Delta$. Restricting this identity to the two disjoint coordinate sets $S$ and $S^c$ gives \begin{align*} \hat{\beta}_S = \beta^*_S+\Delta_S, \qquad \hat{\beta}_{S^c} = \beta^*_{S^c}+\Delta_{S^c} = \Delta_{S^c}. \end{align*} The $\ell^1$ norm is additive over disjoint coordinate sets: if a vector is split into its coordinates on $S$ and on $S^c$, then its $\ell^1$ norm is the sum of the two corresponding $\ell^1$ norms. Therefore \begin{align*} \|\beta^*\|_1 = \|\beta^*_S\|_1 + \|\beta^*_{S^c}\|_1 = \|\beta^*_S\|_1, \end{align*} and \begin{align*} \|\hat{\beta}\|_1 = \|\hat{\beta}_S\|_1+\|\hat{\beta}_{S^c}\|_1 = \|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1. \end{align*} Subtracting these identities gives \begin{align*} \|\beta^*\|_1 - \|\hat{\beta}\|_1 = \|\beta^*_S\|_1 - \|\beta^*_S+\Delta_S\|_1 - \|\Delta_{S^c}\|_1. \end{align*} Now we bound the first two terms. The reverse triangle inequality for the $\ell^1$ norm says that for vectors $u,v \in \mathbb{R}^{|S|}$, \begin{align*} \bigl|\|u\|_1-\|v\|_1\bigr| \leq \|u-v\|_1. \end{align*} Apply this with $u=\beta^*_S$ and $v=\beta^*_S+\Delta_S$. Then \begin{align*} \|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1 \leq \bigl|\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1\bigr| \leq \|\Delta_S\|_1. \end{align*} Substituting this estimate into the penalty decomposition yields \begin{align*} \|\beta^*\|_1 - \|\hat{\beta}\|_1 \leq \|\Delta_S\|_1-\|\Delta_{S^c}\|_1. \end{align*} The important feature is the sign: the coordinates outside the true support enter with a negative coefficient. [/guided] [/step] [step:Combine the bounds and force the cone inequality] Substituting the score estimate and the penalty estimate into the basic inequality gives, on $\mathcal{E}_\lambda$, \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{\lambda}{2}\|\Delta_S\|_1 + \frac{\lambda}{2}\|\Delta_{S^c}\|_1 + \lambda\|\Delta_S\|_1 - \lambda\|\Delta_{S^c}\|_1. \end{align*} Combining like terms on the right-hand side gives \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1 - \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} The left-hand side is non-negative because it is a scalar multiple of a squared Euclidean norm. Hence \begin{align*} 0 \leq \frac{3\lambda}{2}\|\Delta_S\|_1 - \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Since $\lambda>0$, multiplying by the positive scalar \begin{align*} \frac{2}{\lambda} \end{align*} gives \begin{align*} \|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1. \end{align*} This is exactly the desired Lasso cone condition. [guided] We now insert the two estimates proved in the previous steps into the basic inequality. The basic inequality says \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{1}{n}\varepsilon^\top X\Delta + \lambda\bigl(\|\beta^*\|_1-\|\hat{\beta}\|_1\bigr). \end{align*} On $\mathcal{E}_\lambda$, the score estimate gives \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta \leq \frac{\lambda}{2}\|\Delta_S\|_1 + \frac{\lambda}{2}\|\Delta_{S^c}\|_1, \end{align*} and the penalty estimate gives \begin{align*} \|\beta^*\|_1-\|\hat{\beta}\|_1 \leq \|\Delta_S\|_1-\|\Delta_{S^c}\|_1. \end{align*} Substituting both upper bounds into the right-hand side gives \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{\lambda}{2}\|\Delta_S\|_1 + \frac{\lambda}{2}\|\Delta_{S^c}\|_1 + \lambda\|\Delta_S\|_1 - \lambda\|\Delta_{S^c}\|_1. \end{align*} Collecting the coefficients of $\|\Delta_S\|_1$ and $\|\Delta_{S^c}\|_1$ yields \begin{align*} \frac{1}{2n}\|X\Delta\|_2^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1 - \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} The left-hand side is non-negative because it is a non-negative scalar multiple of the squared Euclidean norm of $X\Delta$. Therefore the right-hand side must also be non-negative: \begin{align*} 0 \leq \frac{3\lambda}{2}\|\Delta_S\|_1 - \frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Rearranging this inequality gives \begin{align*} \frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} Since the theorem assumes $\lambda>0$, division by the positive scalar \begin{align*} \frac{\lambda}{2} \end{align*} preserves the inequality and gives \begin{align*} \|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1. \end{align*} This is the Lasso cone condition asserted in the theorem. [/guided] [/step]

Prerequisites (0/4 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Lasso Cone Condition (Theorem # 5557)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Lasso Cone Condition (Theorem # 5557)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further