[proofplan]
Set $\Delta := \hat{\beta}-\beta^*$ and use the [Lasso](/page/Lasso) optimality inequality at $\beta^*$ to compare the empirical prediction loss of $\hat{\beta}$ to the stochastic [inner product](/page/Inner%20Product) $\varepsilon^\top X\Delta$. On the event $\mathcal{E}_\lambda$, this stochastic term is controlled by $\lambda\|\Delta\|_1$, and the support decomposition over $S$ yields both the cone condition $\|\Delta_{S^c}\|_1 \le 3\|\Delta_S\|_1$ and the sharper prediction bound by $3\lambda\|\Delta_S\|_1$. If $s=0$, the same inequality forces $\Delta=0$; otherwise, the [compatibility constant](/page/Compatibility%20Condition) converts $\|\Delta_S\|_1$ into $|X\Delta|/\sqrt n$, after which a one-line algebraic rearrangement gives the stated fast-rate prediction bound.
[/proofplan]
custom_env
admin
[step:Derive the basic inequality from Lasso optimality]
Let $p \in \mathbb{N}$ denote the number of columns of $X$, so that $X \in \mathbb{R}^{n \times p}$ and $\beta^*,\hat{\beta} \in \mathbb{R}^p$. Define the error vector $\Delta \in \mathbb{R}^p$ by
\begin{align*}
\Delta := \hat{\beta}-\beta^*.
\end{align*}
Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, comparison with $\beta^*$ gives
\begin{align*}
\frac{1}{n}|Y-X\hat{\beta}|^2 + 2\lambda\|\hat{\beta}\|_1
\le
\frac{1}{n}|Y-X\beta^*|^2 + 2\lambda\|\beta^*\|_1.
\end{align*}
Using $Y=X\beta^*+\varepsilon$ and $X\hat{\beta}=X\beta^*+X\Delta$, we have
\begin{align*}
Y-X\hat{\beta}=\varepsilon-X\Delta,
\qquad
Y-X\beta^*=\varepsilon.
\end{align*}
Substituting these identities and expanding the Euclidean square gives
\begin{align*}
\frac{1}{n}|\varepsilon-X\Delta|^2 + 2\lambda\|\hat{\beta}\|_1
\le
\frac{1}{n}|\varepsilon|^2 + 2\lambda\|\beta^*\|_1.
\end{align*}
After expanding $|\varepsilon-X\Delta|^2$, cancelling $n^{-1}|\varepsilon|^2$ from both sides, and moving the cross term to the right-hand side, we obtain
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
\frac{2}{n}\varepsilon^\top X\Delta
+2\lambda\bigl(\|\beta^*\|_1-\|\hat{\beta}\|_1\bigr).
\end{align*}
[/step]
custom_env
admin
[step:Use the event $\mathcal{E}_\lambda$ and the support decomposition]Recall that $\mathcal{E}_\lambda$ denotes the event
\begin{align*}
\mathcal{E}_\lambda
:=
\left\{\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty \le \frac{\lambda}{2}\right\}.
\end{align*}
On $\mathcal{E}_\lambda$, the [dual norm inequality](/page/Dual%20Norm) for $\ell^\infty$ and $\ell^1$ gives
\begin{align*}
\left|\frac{2}{n}\varepsilon^\top X\Delta\right|
=
2\left|\left(\frac{1}{n}X^\top\varepsilon\right)^\top \Delta\right|
\le
2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty \|\Delta\|_1
\le
\lambda\|\Delta\|_1.
\end{align*}
Because $\beta^*_{S^c}=0$, we have $\hat{\beta}_{S^c}=\Delta_{S^c}$ and $\hat{\beta}_S=\beta^*_S+\Delta_S$. The [reverse triangle inequality](/page/Reverse%20Triangle%20Inequality) on the coordinates in $S$ gives
\begin{align*}
\|\beta^*\|_1-\|\hat{\beta}\|_1
=
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Using $\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1\le \|\Delta_S\|_1$, this becomes
\begin{align*}
\|\beta^*\|_1-\|\hat{\beta}\|_1
\le
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Combining this estimate with the basic inequality gives
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
\lambda\|\Delta\|_1
+
2\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr).
\end{align*}
Since $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the right-hand side equals
\begin{align*}
3\lambda\|\Delta_S\|_1-\lambda\|\Delta_{S^c}\|_1.
\end{align*}
In particular,
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda\|\Delta_S\|_1.
\end{align*}[/step]
custom_env
admin
[guided]The purpose of this step is to turn the random quantity involving $\varepsilon$ into a deterministic norm bound and then separate the active coordinates $S$ from the inactive coordinates $S^c$.
On the event $\mathcal{E}_\lambda$, the vector $\frac{1}{n}X^\top\varepsilon \in \mathbb{R}^p$ has $\ell^\infty$ norm at most $\lambda/2$. Applying the [dual norm inequality](/page/Dual%20Norm) between $\ell^\infty$ and $\ell^1$ to the vectors $\frac{1}{n}X^\top\varepsilon$ and $\Delta$ gives
\begin{align*}
\left|\frac{2}{n}\varepsilon^\top X\Delta\right|
=
2\left|\left(\frac{1}{n}X^\top\varepsilon\right)^\top \Delta\right|
\le
2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty \|\Delta\|_1
\le
\lambda\|\Delta\|_1.
\end{align*}
This is exactly where the event $\mathcal{E}_\lambda$ is used.
Next we exploit sparsity. Since $S=\operatorname{supp}(\beta^*)$, the inactive coordinates satisfy $\beta^*_{S^c}=0$. Hence
\begin{align*}
\hat{\beta}_{S^c}
=
\beta^*_{S^c}+\Delta_{S^c}
=
\Delta_{S^c}.
\end{align*}
On the active coordinates, we also have
\begin{align*}
\hat{\beta}_S
=
\beta^*_S+\Delta_S.
\end{align*}
Therefore
\begin{align*}
\|\hat{\beta}\|_1
=
\|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1.
\end{align*}
By the [reverse triangle inequality](/page/Reverse%20Triangle%20Inequality) on $\mathbb{R}^S$,
\begin{align*}
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1
\le
\|\Delta_S\|_1.
\end{align*}
Since $\|\beta^*\|_1=\|\beta^*_S\|_1$, it follows that
\begin{align*}
\|\beta^*\|_1-\|\hat{\beta}\|_1
=
\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Combining this identity with the [reverse triangle inequality](/theorems/2300) gives
\begin{align*}
\|\beta^*\|_1-\|\hat{\beta}\|_1
\le
\|\Delta_S\|_1-\|\Delta_{S^c}\|_1.
\end{align*}
Substituting the stochastic bound and this support estimate into the basic inequality yields
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
\lambda\|\Delta\|_1
+
2\lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr).
\end{align*}
Using the support decomposition $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the right-hand side is
\begin{align*}
\lambda\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr)
+
2\lambda\|\Delta_S\|_1
-
2\lambda\|\Delta_{S^c}\|_1.
\end{align*}
Collecting the active and inactive terms gives
\begin{align*}
3\lambda\|\Delta_S\|_1-\lambda\|\Delta_{S^c}\|_1.
\end{align*}
Dropping the non-positive term $-\lambda\|\Delta_{S^c}\|_1$ gives the sharper prediction estimate
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda\|\Delta_S\|_1.
\end{align*}[/guided]
custom_env
admin
[step:Place the error vector in the compatibility cone]
From the inequality just obtained,
\begin{align*}
0
\le
\frac{1}{n}|X\Delta|^2
\le
3\lambda\|\Delta_S\|_1-\lambda\|\Delta_{S^c}\|_1.
\end{align*}
Since $\lambda>0$ by the Lasso tuning-parameter hypothesis, this implies
\begin{align*}
\|\Delta_{S^c}\|_1 \le 3\|\Delta_S\|_1.
\end{align*}
Thus $\Delta$ belongs to the cone on which the [compatibility constant](/page/Compatibility%20Condition) $\phi_0(S,3)$ is defined, unless $\|\Delta_S\|_1=0$. If $\|\Delta_S\|_1=0$, then the displayed inequality above gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and therefore $|X(\hat{\beta}-\beta^*)|=0$. This proves the asserted $s=0$ case and also handles the degenerate case $s\ge 1$ with $\|\Delta_S\|_1=0$. We may therefore assume $s\ge 1$ and $\|\Delta_S\|_1>0$.
[/step]
custom_env
admin
[step:Apply compatibility to convert the active $\ell^1$ error into prediction error]Since $\Delta \in \mathbb{R}^p$ satisfies
\begin{align*}
\|\Delta_{S^c}\|_1 \le 3\|\Delta_S\|_1
\qquad\text{and}\qquad
\|\Delta_S\|_1>0,
\end{align*}
the definition of $\phi_0(S,3)$ gives
\begin{align*}
\phi_0(S,3)^2
\le
\frac{s\,|X\Delta|^2}{n\|\Delta_S\|_1^2}.
\end{align*}
Equivalently,
\begin{align*}
\|\Delta_S\|_1
\le
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}.
\end{align*}
Combining this with the prediction estimate from the previous step gives
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}.
\end{align*}[/step]
custom_env
admin
[guided]The compatibility condition is designed precisely to convert a sparse $\ell^1$ quantity into a prediction norm quantity. We have already proved that $\Delta$ lies in the cone
\begin{align*}
\|\Delta_{S^c}\|_1 \le 3\|\Delta_S\|_1,
\end{align*}
and in this step we are in the non-degenerate case $s \ge 1$ and $\|\Delta_S\|_1>0$. Therefore the defining inequality for the [compatibility constant](/page/Compatibility%20Condition) $\phi_0(S,3)$ applies to the vector $\Delta \in \mathbb{R}^p$ and gives
\begin{align*}
\phi_0(S,3)^2
\le
\frac{s\,|X\Delta|^2}{n\|\Delta_S\|_1^2}.
\end{align*}
Multiplying by $\|\Delta_S\|_1^2$ and then taking square roots is valid because all quantities are non-negative and $\phi_0(S,3)>0$ by hypothesis. Hence
\begin{align*}
\|\Delta_S\|_1
\le
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}.
\end{align*}
The previous prediction estimate was
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda\|\Delta_S\|_1.
\end{align*}
Substituting the compatibility bound for $\|\Delta_S\|_1$ into this estimate yields
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}.
\end{align*}
This is the key fast-rate inequality: the prediction loss now appears on both sides, so the final step is only algebra.[/guided]
custom_env
admin
[step:Divide by the prediction norm and square the resulting bound]If $|X\Delta|=0$, then
\begin{align*}
\frac{1}{n}|X\Delta|^2=0
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2},
\end{align*}
so the result holds. Assume now that $|X\Delta|>0$. Dividing the inequality
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}
\end{align*}
by the positive quantity $|X\Delta|/\sqrt n$ gives
\begin{align*}
\frac{|X\Delta|}{\sqrt n}
\le
\frac{3\lambda\sqrt{s}}{\phi_0(S,3)}.
\end{align*}
Squaring both sides yields
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2}.
\end{align*}
Since $\Delta=\hat{\beta}-\beta^*$, this is precisely
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2}.
\end{align*}[/step]
custom_env
admin
[guided]There are two cases. If $|X\Delta|=0$, then the prediction loss is already zero:
\begin{align*}
\frac{1}{n}|X\Delta|^2=0
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2},
\end{align*}
so the desired estimate holds.
Assume instead that $|X\Delta|>0$. From the compatibility step we have
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
3\lambda
\frac{\sqrt{s}\,|X\Delta|}{\sqrt{n}\,\phi_0(S,3)}.
\end{align*}
The factor $|X\Delta|/\sqrt n$ is positive in this case, so division by it preserves the inequality and gives
\begin{align*}
\frac{|X\Delta|}{\sqrt n}
\le
\frac{3\lambda\sqrt{s}}{\phi_0(S,3)}.
\end{align*}
Both sides are non-negative, so squaring preserves the inequality:
\begin{align*}
\frac{1}{n}|X\Delta|^2
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2}.
\end{align*}
Finally, substituting the definition $\Delta=\hat{\beta}-\beta^*$ gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\le
\frac{9\lambda^2 s}{\phi_0(S,3)^2}.
\end{align*}
This is exactly the theorem's claimed prediction bound.[/guided]