[proofplan]
The proof starts from the optimality of $\hat{\beta}$ in the Lasso objective and compares the objective value at $\hat{\beta}$ with the value at the true vector $\beta^*$. After substituting $Y = X\beta^* + \varepsilon$, the resulting basic inequality contains a quadratic prediction term, a stochastic score term, and the difference of two $\ell^1$ penalties. On $\mathcal{E}_\lambda$, the score term is bounded by one half of $\lambda\|\Delta\|_1$, while the support condition $\beta^*_{S^c}=0$ makes the penalty difference separate into an $S$ part and an $S^c$ part. The non-negativity of the quadratic prediction term then forces the negative $S^c$ contribution not to dominate the positive $S$ contribution.
[/proofplan]
[step:Derive the basic inequality from Lasso optimality]
Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, evaluating the same objective at $\beta^*$ gives
\begin{align*}
\frac{1}{2n}\|Y - X\hat{\beta}\|_2^2 + \lambda \|\hat{\beta}\|_1
\leq
\frac{1}{2n}\|Y - X\beta^*\|_2^2 + \lambda \|\beta^*\|_1.
\end{align*}
Using $Y = X\beta^* + \varepsilon$ and $\Delta = \hat{\beta} - \beta^*$, we have
\begin{align*}
Y - X\hat{\beta}
=
X\beta^* + \varepsilon - X\hat{\beta}
=
\varepsilon - X\Delta,
\qquad
Y - X\beta^*
=
\varepsilon.
\end{align*}
Substituting these identities into the optimality inequality yields
\begin{align*}
\frac{1}{2n}\|\varepsilon - X\Delta\|_2^2 + \lambda \|\hat{\beta}\|_1
\leq
\frac{1}{2n}\|\varepsilon\|_2^2 + \lambda \|\beta^*\|_1.
\end{align*}
Expanding the square in the Euclidean norm gives
\begin{align*}
\|\varepsilon - X\Delta\|_2^2
=
\|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2.
\end{align*}
After cancellation of $\|\varepsilon\|_2^2/(2n)$ and rearrangement, we obtain the basic inequality
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{1}{n}\varepsilon^\top X\Delta
+
\lambda\bigl(\|\beta^*\|_1 - \|\hat{\beta}\|_1\bigr).
\end{align*}
[guided]
The only input in this step is the defining property of $\hat{\beta}$ as a minimizer. The Lasso objective is the map $Q_\lambda: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
Q_\lambda(\beta) := \frac{1}{2n}\|Y - X\beta\|_2^2 + \lambda\|\beta\|_1.
\end{align*}
Since $\hat{\beta}$ minimizes $Q_\lambda$ over $\mathbb{R}^p$, the comparison vector $\beta^*$ is admissible, and hence
\begin{align*}
Q_\lambda(\hat{\beta}) \leq Q_\lambda(\beta^*).
\end{align*}
Writing this out gives
\begin{align*}
\frac{1}{2n}\|Y - X\hat{\beta}\|_2^2 + \lambda \|\hat{\beta}\|_1
\leq
\frac{1}{2n}\|Y - X\beta^*\|_2^2 + \lambda \|\beta^*\|_1.
\end{align*}
Now we use the statistical model. The response is $Y = X\beta^* + \varepsilon$, and the error vector is $\Delta = \hat{\beta} - \beta^*$. Therefore
\begin{align*}
Y - X\hat{\beta} = X\beta^* + \varepsilon - X\hat{\beta}.
\end{align*}
Since $\Delta=\hat{\beta}-\beta^*$, this becomes
\begin{align*}
Y - X\hat{\beta} = \varepsilon - X\Delta.
\end{align*}
while
\begin{align*}
Y - X\beta^* = \varepsilon.
\end{align*}
Substituting these two identities into the objective comparison gives
\begin{align*}
\frac{1}{2n}\|\varepsilon - X\Delta\|_2^2 + \lambda \|\hat{\beta}\|_1
\leq
\frac{1}{2n}\|\varepsilon\|_2^2 + \lambda \|\beta^*\|_1.
\end{align*}
The purpose of expanding the square is to isolate the prediction error $\|X\Delta\|_2^2$. Using the Euclidean identity
\begin{align*}
|a-b|^2 = |a|^2 - 2a^\top b + |b|^2
\end{align*}
with $a=\varepsilon$ and $b=X\Delta$, we get
\begin{align*}
\|\varepsilon - X\Delta\|_2^2
=
\|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2.
\end{align*}
Thus
\begin{align*}
\frac{1}{2n}
\left(
\|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta + \|X\Delta\|_2^2
\right)
+
\lambda\|\hat{\beta}\|_1
\leq
\frac{1}{2n}\|\varepsilon\|_2^2
+
\lambda\|\beta^*\|_1.
\end{align*}
Canceling $\|\varepsilon\|_2^2/(2n)$ from both sides and moving the remaining terms gives
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{1}{n}\varepsilon^\top X\Delta
+
\lambda\bigl(\|\beta^*\|_1 - \|\hat{\beta}\|_1\bigr).
\end{align*}
This is the basic inequality: it converts optimality of the penalized estimator into a deterministic inequality involving the prediction error, the score term, and the penalty difference.
[/guided]
[/step]
[step:Control the score term on $\mathcal{E}_\lambda$]
Define the score vector $z \in \mathbb{R}^p$ by
\begin{align*}
z := \frac{1}{n}X^\top \varepsilon.
\end{align*}
Then
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
=
z^\top \Delta.
\end{align*}
On $\mathcal{E}_\lambda$, we have
\begin{align*}
\|z\|_\infty \leq \frac{\lambda}{2}.
\end{align*}
The duality inequality between the $\ell^\infty$ and $\ell^1$ norms gives
\begin{align*}
z^\top \Delta
\leq
|z^\top \Delta|
\leq
\|z\|_\infty \|\Delta\|_1
\leq
\frac{\lambda}{2}\|\Delta\|_1.
\end{align*}
Since $\Delta_S$ and $\Delta_{S^c}$ have disjoint supports,
\begin{align*}
\|\Delta\|_1 = \|\Delta_S\|_1 + \|\Delta_{S^c}\|_1.
\end{align*}
Therefore, on $\mathcal{E}_\lambda$,
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
\leq
\frac{\lambda}{2}\|\Delta_S\|_1
+
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
[guided]
The score term is the only random linear term left in the basic inequality, so we rewrite it as an [inner product](/page/Inner%20Product) with a vector whose sup norm is controlled on $\mathcal{E}_\lambda$. Define
\begin{align*}
z := \frac{1}{n}X^\top \varepsilon \in \mathbb{R}^p.
\end{align*}
By associativity of matrix multiplication and the definition of the Euclidean inner product on $\mathbb{R}^p$,
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
=
\left(\frac{1}{n}X^\top\varepsilon\right)^\top\Delta
=
z^\top\Delta.
\end{align*}
The event $\mathcal{E}_\lambda$ is precisely the event on which this score vector is small in $\ell^\infty$ norm, namely
\begin{align*}
\|z\|_\infty \leq \frac{\lambda}{2}.
\end{align*}
We now pair this sup-norm control with the $\ell^1$ size of the error. The duality inequality between $\ell^\infty$ and $\ell^1$ on $\mathbb{R}^p$ states that for any $a,b \in \mathbb{R}^p$,
\begin{align*}
|a^\top b| \leq \|a\|_\infty\|b\|_1.
\end{align*}
Applying this inequality with $a=z$ and $b=\Delta$ gives
\begin{align*}
z^\top\Delta
\leq
|z^\top\Delta|
\leq
\|z\|_\infty\|\Delta\|_1
\leq
\frac{\lambda}{2}\|\Delta\|_1.
\end{align*}
Finally, we split the $\ell^1$ norm of $\Delta$ along the two disjoint coordinate sets $S$ and $S^c$. Since these supports are disjoint, the coordinate sums defining the $\ell^1$ norm separate:
\begin{align*}
\|\Delta\|_1
=
\|\Delta_S\|_1+\|\Delta_{S^c}\|_1.
\end{align*}
Substituting this decomposition into the score bound yields
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
\leq
\frac{\lambda}{2}\|\Delta_S\|_1
+
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
[/guided]
[/step]
[step:Decompose the penalty difference using the support of $\beta^*$]
Because $S=\operatorname{supp}(\beta^*)$, we have $\beta^*_{S^c}=0$. Since $\hat{\beta}=\beta^*+\Delta$, the restrictions of $\hat{\beta}$ to $S$ and $S^c$ are
\begin{align*}
\hat{\beta}_S = \beta^*_S + \Delta_S,
\qquad
\hat{\beta}_{S^c} = \Delta_{S^c}.
\end{align*}
Using additivity of the $\ell^1$ norm over disjoint coordinate sets,
\begin{align*}
\|\beta^*\|_1 - \|\hat{\beta}\|_1
&=
\|\beta^*_S\|_1
-
\|\beta^*_S+\Delta_S\|_1
-
\|\Delta_{S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
\|\beta^*_S\|_1 - \|\beta^*_S+\Delta_S\|_1
\leq
\|\Delta_S\|_1.
\end{align*}
Consequently,
\begin{align*}
\|\beta^*\|_1 - \|\hat{\beta}\|_1
\leq
\|\Delta_S\|_1 - \|\Delta_{S^c}\|_1.
\end{align*}
[guided]
The support condition is what creates the cone. Since $S=\operatorname{supp}(\beta^*)$, all coordinates of $\beta^*$ outside $S$ vanish, which means
\begin{align*}
\beta^*_{S^c}=0.
\end{align*}
Because $\Delta=\hat{\beta}-\beta^*$, we also have $\hat{\beta}=\beta^*+\Delta$. Restricting this identity to the two disjoint coordinate sets $S$ and $S^c$ gives
\begin{align*}
\hat{\beta}_S = \beta^*_S+\Delta_S,
\qquad
\hat{\beta}_{S^c} = \beta^*_{S^c}+\Delta_{S^c} = \Delta_{S^c}.
\end{align*}
The $\ell^1$ norm is additive over disjoint coordinate sets: if a vector is split into its coordinates on $S$ and on $S^c$, then its $\ell^1$ norm is the sum of the two corresponding $\ell^1$ norms. Therefore
\begin{align*}
\|\beta^*\|_1
=
\|\beta^*_S\|_1 + \|\beta^*_{S^c}\|_1
=
\|\beta^*_S\|_1,
\end{align*}
and
\begin{align*}
\|\hat{\beta}\|_1
=
\|\hat{\beta}_S\|_1+\|\hat{\beta}_{S^c}\|_1
=
\|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1.
\end{align*}
Subtracting these identities gives
\begin{align*}
\|\beta^*\|_1 - \|\hat{\beta}\|_1
=
\|\beta^*_S\|_1
-
\|\beta^*_S+\Delta_S\|_1
-
\|\Delta_{S^c}\|_1.
\end{align*}
Now we bound the first two terms. The reverse triangle inequality for the $\ell^1$ norm says that for vectors $u,v \in \mathbb{R}^{|S|}$,
\begin{align*}
\bigl|\|u\|_1-\|v\|_1\bigr| \leq \|u-v\|_1.
\end{align*}
Apply this with $u=\beta^*_S$ and $v=\beta^*_S+\Delta_S$. Then
\begin{align*}
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1
\leq
\bigl|\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1\bigr|
\leq
\|\Delta_S\|_1.
\end{align*}
Substituting this estimate into the penalty decomposition yields
\begin{align*}
\|\beta^*\|_1 - \|\hat{\beta}\|_1
\leq
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
The important feature is the sign: the coordinates outside the true support enter with a negative coefficient.
[/guided]
[/step]
[step:Combine the bounds and force the cone inequality]
Substituting the score estimate and the penalty estimate into the basic inequality gives, on $\mathcal{E}_\lambda$,
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{\lambda}{2}\|\Delta_S\|_1
+
\frac{\lambda}{2}\|\Delta_{S^c}\|_1
+
\lambda\|\Delta_S\|_1
-
\lambda\|\Delta_{S^c}\|_1.
\end{align*}
Combining like terms on the right-hand side gives
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1
-
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
The left-hand side is non-negative because it is a scalar multiple of a squared Euclidean norm. Hence
\begin{align*}
0
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1
-
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Since $\lambda>0$, multiplying by the positive scalar
\begin{align*}
\frac{2}{\lambda}
\end{align*}
gives
\begin{align*}
\|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1.
\end{align*}
This is exactly the desired Lasso cone condition.
[guided]
We now insert the two estimates proved in the previous steps into the basic inequality. The basic inequality says
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{1}{n}\varepsilon^\top X\Delta
+
\lambda\bigl(\|\beta^*\|_1-\|\hat{\beta}\|_1\bigr).
\end{align*}
On $\mathcal{E}_\lambda$, the score estimate gives
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
\leq
\frac{\lambda}{2}\|\Delta_S\|_1
+
\frac{\lambda}{2}\|\Delta_{S^c}\|_1,
\end{align*}
and the penalty estimate gives
\begin{align*}
\|\beta^*\|_1-\|\hat{\beta}\|_1
\leq
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Substituting both upper bounds into the right-hand side gives
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{\lambda}{2}\|\Delta_S\|_1
+
\frac{\lambda}{2}\|\Delta_{S^c}\|_1
+
\lambda\|\Delta_S\|_1
-
\lambda\|\Delta_{S^c}\|_1.
\end{align*}
Collecting the coefficients of $\|\Delta_S\|_1$ and $\|\Delta_{S^c}\|_1$ yields
\begin{align*}
\frac{1}{2n}\|X\Delta\|_2^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1
-
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
The left-hand side is non-negative because it is a non-negative scalar multiple of the squared Euclidean norm of $X\Delta$. Therefore the right-hand side must also be non-negative:
\begin{align*}
0
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1
-
\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Rearranging this inequality gives
\begin{align*}
\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
Since the theorem assumes $\lambda>0$, division by the positive scalar
\begin{align*}
\frac{\lambda}{2}
\end{align*}
preserves the inequality and gives
\begin{align*}
\|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1.
\end{align*}
This is the Lasso cone condition asserted in the theorem.
[/guided]
[/step]