[proofplan]
Set $\Delta := \hat{\beta}-\beta^*$ and use the optimality of $\hat{\beta}$ in the Lasso objective to obtain the basic inequality. On the score event, the stochastic [inner product](/page/Inner%20Product) is controlled by $\frac{\lambda}{2}\|\Delta\|_1$, and the support decomposition of $\beta^*$ sharpens the inequality into a cone inequality and a prediction-error inequality. The cone condition permits the compatibility constant to control $\|\Delta_S\|_1$ by $|X\Delta|/\sqrt n$, and solving the resulting scalar quadratic inequality gives the desired fast prediction bound.
[/proofplan]
[step:Derive the sharpened basic inequality from Lasso optimality]
Define the estimation error vector $\Delta \in \mathbb{R}^p$ by
\begin{align*}
\Delta := \hat{\beta}-\beta^*.
\end{align*}
Throughout the proof, $|\cdot|$ denotes the Euclidean norm on the relevant finite-dimensional Euclidean space, and $\|\cdot\|_1$ and $\|\cdot\|_\infty$ denote the usual $\ell^1$ and $\ell^\infty$ norms on $\mathbb{R}^p$.
Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, comparison with $\beta^*$ gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1
\leq
\frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1.
\end{align*}
Using $Y=X\beta^*+\varepsilon$ and $X\hat{\beta}=X\beta^*+X\Delta$, this becomes
\begin{align*}
\frac{1}{2n}|\varepsilon-X\Delta|^2+\lambda\|\beta^*+\Delta\|_1
\leq
\frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1.
\end{align*}
Expanding the squared Euclidean norm,
\begin{align*}
|\varepsilon-X\Delta|^2
=
|\varepsilon|^2-2\varepsilon^\top X\Delta+|X\Delta|^2,
\end{align*}
and cancelling $\frac{1}{2n}|\varepsilon|^2$ yields
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{1}{n}\varepsilon^\top X\Delta
+
\lambda\bigl(\|\beta^*\|_1-\|\beta^*+\Delta\|_1\bigr).
\end{align*}
On $\mathcal{E}_\lambda$, the finite-dimensional dual norm inequality for the pair $(\ell^\infty,\ell^1)$ gives
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
=
\left(\frac{1}{n}X^\top\varepsilon\right)^\top \Delta
\leq
\frac{1}{n}\|X^\top\varepsilon\|_\infty\|\Delta\|_1
\leq
\frac{\lambda}{2}\|\Delta\|_1.
\end{align*}
Since $\beta^*_{S^c}=0$, decomposing the $\ell^1$ norm over $S$ and $S^c$ gives
\begin{align*}
\|\beta^*\|_1-\|\beta^*+\Delta\|_1
=
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/theorems/2300) in $\ell^1(S)$ gives $\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1 \leq \|\Delta_S\|_1$, hence
\begin{align*}
\|\beta^*\|_1-\|\beta^*+\Delta\|_1
\leq
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Combining these estimates and using $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the combined right-hand side is
\begin{align*}
\frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr)
+
\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr)
=
\frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Therefore
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Equivalently,
\begin{align*}
\frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
[guided]
The proof starts from the only property of $\hat{\beta}$ that is needed: it has objective value no larger than the value at the true coefficient vector $\beta^*$. Define the error vector $\Delta \in \mathbb{R}^p$ by
\begin{align*}
\Delta := \hat{\beta}-\beta^*.
\end{align*}
In this proof, $|\cdot|$ denotes the Euclidean norm on the relevant finite-dimensional Euclidean space, while $\|\cdot\|_1$ and $\|\cdot\|_\infty$ denote the usual $\ell^1$ and $\ell^\infty$ norms on $\mathbb{R}^p$. Thus $|Y-X\hat{\beta}|$ and $|X\Delta|$ are Euclidean prediction norms, whereas $\|\hat{\beta}\|_1$ is the Lasso penalty norm.
The Lasso optimality condition gives
\begin{align*}
\frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1
\leq
\frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1.
\end{align*}
Because the model is $Y=X\beta^*+\varepsilon$, the two residuals are
\begin{align*}
Y-X\hat{\beta}
=
\varepsilon-X\Delta,
\qquad
Y-X\beta^*
=
\varepsilon.
\end{align*}
Substituting these identities into the optimality inequality gives
\begin{align*}
\frac{1}{2n}|\varepsilon-X\Delta|^2+\lambda\|\beta^*+\Delta\|_1
\leq
\frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1.
\end{align*}
Now expand the squared Euclidean norm:
\begin{align*}
|\varepsilon-X\Delta|^2
=
|\varepsilon|^2-2\varepsilon^\top X\Delta+|X\Delta|^2.
\end{align*}
After cancelling $\frac{1}{2n}|\varepsilon|^2$ from both sides, we obtain
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{1}{n}\varepsilon^\top X\Delta
+
\lambda\bigl(\|\beta^*\|_1-\|\beta^*+\Delta\|_1\bigr).
\end{align*}
The score event is designed exactly to control the stochastic inner product. The finite-dimensional [dual norm](/page/Dual%20Norm) inequality says that for vectors $a,b \in \mathbb{R}^p$, $a^\top b \leq \|a\|_\infty\|b\|_1$. Applying this with $a=\frac{1}{n}X^\top\varepsilon$ and $b=\Delta$ gives
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
=
\left(\frac{1}{n}X^\top\varepsilon\right)^\top\Delta
\leq
\frac{1}{n}\|X^\top\varepsilon\|_\infty\|\Delta\|_1.
\end{align*}
On $\mathcal{E}_\lambda$, the factor $\frac{1}{n}\|X^\top\varepsilon\|_\infty$ is at most $\lambda/2$, hence
\begin{align*}
\frac{1}{n}\varepsilon^\top X\Delta
\leq
\frac{\lambda}{2}\|\Delta\|_1.
\end{align*}
It remains to exploit the sparsity of $\beta^*$. Since $S=\operatorname{supp}(\beta^*)$, we have $\beta^*_{S^c}=0$. Therefore
\begin{align*}
\|\beta^*+\Delta\|_1
=
\|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/page/Triangle%20Inequality) in $\ell^1(S)$ gives
\begin{align*}
\|\beta^*_S+\Delta_S\|_1
\geq
\|\beta^*_S\|_1-\|\Delta_S\|_1.
\end{align*}
Therefore
\begin{align*}
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1
\leq
\|\Delta_S\|_1.
\end{align*}
Combining this with the support decomposition gives
\begin{align*}
\|\beta^*\|_1-\|\beta^*+\Delta\|_1
\leq
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Substituting the stochastic bound and this support decomposition into the basic inequality gives the next estimate. First, using $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the right-hand side becomes
\begin{align*}
\frac{\lambda}{2}\|\Delta\|_1
+
\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr)
=
\frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr)
+
\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr).
\end{align*}
Collecting the coefficients of $\|\Delta_S\|_1$ and $\|\Delta_{S^c}\|_1$ gives
\begin{align*}
\frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr)
+
\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr)
=
\frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Hence
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1.
\end{align*}
Moving the non-positive term on the right to the left gives the sharpened basic inequality
\begin{align*}
\frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
This inequality is the point where the noise control and the sparsity of $\beta^*$ meet.
[/guided]
[/step]
[step:Use the sharpened inequality to put the error in the compatibility cone]
From
\begin{align*}
\frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1
\end{align*}
and the non-negativity of $\frac{1}{2n}|X\Delta|^2$, we obtain
\begin{align*}
\|\Delta_{S^c}\|_1
\leq
3\|\Delta_S\|_1.
\end{align*}
Thus $\Delta$ belongs to the cone appearing in the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$ whenever $\Delta_S \neq 0$. If $\Delta_S=0$, the same inequality gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and the desired prediction bound follows immediately. Therefore, for the remaining argument, assume $\Delta_S\neq 0$.
[guided]
Starting from the sharpened basic inequality,
\begin{align*}
\frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1,
\end{align*}
we isolate the part that forces the Lasso error into the compatibility cone. The term $\frac{1}{2n}|X\Delta|^2$ is non-negative because it is a scalar multiple of a squared Euclidean norm. Dropping this non-negative term from the left-hand side gives the weaker inequality
\begin{align*}
\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
Since $\lambda>0$ by hypothesis, division by $\lambda/2$ gives
\begin{align*}
\|\Delta_{S^c}\|_1
\leq
3\|\Delta_S\|_1.
\end{align*}
This is exactly the cone condition appearing in the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$, provided $\Delta_S\neq 0$.
There is one degenerate case to remove before applying compatibility. If $\Delta_S=0$, then the cone inequality becomes
\begin{align*}
\|\Delta_{S^c}\|_1
\leq
0.
\end{align*}
Because an $\ell^1$ norm is non-negative, this implies $\|\Delta_{S^c}\|_1=0$, hence $\Delta_{S^c}=0$. Together with $\Delta_S=0$, this gives $\Delta=0$. Therefore $X(\hat\beta-\beta^*)=X\Delta=0$, and the theorem's prediction bound follows because the right-hand side is non-negative. Thus, in the remaining case, we may assume $\Delta_S\neq 0$ and $\Delta$ lies in the compatibility cone.
[/guided]
[/step]
[step:Apply the compatibility constant to control the active coordinates]
Since $\Delta_S\neq 0$ and $\|\Delta_{S^c}\|_1\leq 3\|\Delta_S\|_1$, the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$ gives
\begin{align*}
\phi_{\mathrm{comp}}(S,3)
\leq
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\|\Delta_S\|_1}.
\end{align*}
Because $\phi_{\mathrm{comp}}(S,3)>0$, rearranging yields
\begin{align*}
\|\Delta_S\|_1
\leq
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
Discarding the non-negative term $\frac{\lambda}{2}\|\Delta_{S^c}\|_1$ from the sharpened basic inequality gives
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
Substituting the compatibility bound gives
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
[guided]
Now the purpose of the cone condition becomes visible: it allows the [compatibility constant](/page/Compatibility%20Constant) to convert the active-coordinate quantity $\|\Delta_S\|_1$ into the prediction norm $|X\Delta|/\sqrt n$. We have already verified the two hypotheses needed to use the definition: first, $\Delta_S\neq 0$; second,
\begin{align*}
\|\Delta_{S^c}\|_1
\leq
3\|\Delta_S\|_1.
\end{align*}
Therefore the definition of $\phi_{\mathrm{comp}}(S,3)$ gives
\begin{align*}
\phi_{\mathrm{comp}}(S,3)
\leq
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\|\Delta_S\|_1}.
\end{align*}
The theorem assumes $\phi_{\mathrm{comp}}(S,3)>0$, and $\|\Delta_S\|_1>0$ follows from $\Delta_S\neq 0$. Hence we may multiply by $\|\Delta_S\|_1$ and divide by $\phi_{\mathrm{comp}}(S,3)$ to obtain
\begin{align*}
\|\Delta_S\|_1
\leq
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
Return to the sharpened basic inequality,
\begin{align*}
\frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
Since $\frac{\lambda}{2}\|\Delta_{S^c}\|_1\geq 0$, discarding it from the left-hand side gives the weaker estimate
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}\|\Delta_S\|_1.
\end{align*}
Substituting the compatibility bound for $\|\Delta_S\|_1$ yields
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
This reduces the proof to a scalar inequality for the normalized prediction error.
[/guided]
[/step]
[step:Solve the scalar inequality and square the prediction norm]
Define the scalar $u \in [0,\infty)$ by
\begin{align*}
u := \frac{|X\Delta|}{\sqrt{n}}.
\end{align*}
The preceding inequality becomes
\begin{align*}
\frac{u^2}{2}
\leq
\frac{3\lambda\sqrt{s}}{2\phi_{\mathrm{comp}}(S,3)}u.
\end{align*}
If $u=0$, the desired bound is immediate. If $u>0$, divide both sides by $u/2$ to obtain
\begin{align*}
u
\leq
\frac{3\lambda\sqrt{s}}{\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
Squaring both sides and using $\Delta=\hat{\beta}-\beta^*$ gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
=
u^2
\leq
\frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}.
\end{align*}
This is the claimed fast prediction bound.
[guided]
Define the normalized prediction-error scalar $u \in [0,\infty)$ by
\begin{align*}
u := \frac{|X\Delta|}{\sqrt{n}}.
\end{align*}
The previous step proved
\begin{align*}
\frac{1}{2n}|X\Delta|^2
\leq
\frac{3\lambda}{2}
\frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
Using the definition of $u$, this is exactly
\begin{align*}
\frac{u^2}{2}
\leq
\frac{3\lambda\sqrt{s}}{2\phi_{\mathrm{comp}}(S,3)}u.
\end{align*}
If $u=0$, then
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2
=
u^2
=
0,
\end{align*}
and the desired inequality follows because $9\lambda^2s/\phi_{\mathrm{comp}}(S,3)^2\geq 0$.
Assume instead that $u>0$. Then $u/2>0$, so we may divide both sides of the scalar inequality by $u/2$ and obtain
\begin{align*}
u
\leq
\frac{3\lambda\sqrt{s}}{\phi_{\mathrm{comp}}(S,3)}.
\end{align*}
Both sides are non-negative, so squaring preserves the inequality:
\begin{align*}
u^2
\leq
\frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}.
\end{align*}
Finally, since $\Delta=\hat\beta-\beta^*$, the definition of $u$ gives
\begin{align*}
\frac{1}{n}|X(\hat\beta-\beta^*)|^2
=
\frac{1}{n}|X\Delta|^2
=
u^2
\leq
\frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}.
\end{align*}
This proves the theorem in guided mode as well as in exact mode.
[/guided]
[/step]