[proofplan]
We compare the Lasso solution $\hat{\beta}$ with an arbitrary sparse vector $\beta$, rather than only with the true parameter $\beta^*$. The optimality inequality gives a deterministic bound involving the stochastic score term and the penalty difference. On the event $\mathcal{E}_\lambda$, the score term is controlled by the $\ell^1$ distance, and the penalty decomposition either already gives the desired prediction bound or places the error vector in the compatibility cone. In the cone case, the compatibility condition converts the active-coordinate $\ell^1$ error into prediction error, and elementary quadratic inequalities absorb the remaining prediction term.
[/proofplan]
[step:Derive the basic inequality against an arbitrary sparse comparator]
Fix $\beta \in \mathbb{R}^p$ with $\|\beta\|_0 \leq m$. Define $d \in \mathbb{R}^p$ by
\begin{align*}
d := \hat{\beta}-\beta.
\end{align*}
Define $S \subseteq \{1,\dots,p\}$ by
\begin{align*}
S := \{j \in \{1,\dots,p\}: \beta_j \neq 0\}.
\end{align*}
Define $s \in \mathbb{N} \cup \{0\}$ by
\begin{align*}
s := |S|.
\end{align*}
Thus $s=\|\beta\|_0 \leq m$.
Since $\hat{\beta}$ minimizes the Lasso objective, evaluating the objective at $\hat{\beta}$ and at $\beta$ gives
\begin{align*}
\frac{1}{n}|Y-X\hat{\beta}|^2 + 2\lambda\|\hat{\beta}\|_1
\leq
\frac{1}{n}|Y-X\beta|^2 + 2\lambda\|\beta\|_1.
\end{align*}
Using $Y=X\beta^*+\varepsilon$, expanding both squared norms, and cancelling $\frac{1}{n}|\varepsilon|^2$, we obtain
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)
+
2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1).
\end{align*}
On $\mathcal{E}_\lambda$, the dual norm inequality gives
\begin{align*}
\left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right|
=
2\left|\left(\frac{1}{n}X^\top\varepsilon\right)^\top d\right|
\leq
2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty \|d\|_1
\leq
\lambda\|d\|_1.
\end{align*}
Therefore
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\lambda\|d\|_1
+
2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1).
\end{align*}
[guided]
We begin by comparing $\hat{\beta}$ with a completely arbitrary sparse vector $\beta$. This is the key difference between an estimation bound around $\beta^*$ and an oracle inequality: the vector $\beta$ is allowed to approximate $\beta^*$, so the final estimate will contain the approximation term
\begin{align*}
\frac{1}{n}|X(\beta-\beta^*)|^2.
\end{align*}
Define $d \in \mathbb{R}^p$ by
\begin{align*}
d := \hat{\beta}-\beta.
\end{align*}
Define $S \subseteq \{1,\dots,p\}$ by
\begin{align*}
S := \{j \in \{1,\dots,p\}: \beta_j \neq 0\}.
\end{align*}
Define $s \in \mathbb{N} \cup \{0\}$ by
\begin{align*}
s := |S|.
\end{align*}
Since $S$ is the support of $\beta$, we have $s=\|\beta\|_0 \leq m$.
The theorem statement defines a Lasso solution as a minimizer of the map $Q: \mathbb{R}^p \to \mathbb{R}$ defined by
\begin{align*}
Q(b) := \frac{1}{n}|Y-Xb|^2 + 2\lambda\|b\|_1.
\end{align*}
Because $\hat{\beta}$ is a minimizer of $Q$, we have $Q(\hat{\beta}) \leq Q(\beta)$, namely
\begin{align*}
\frac{1}{n}|Y-X\hat{\beta}|^2 + 2\lambda\|\hat{\beta}\|_1
\leq
\frac{1}{n}|Y-X\beta|^2 + 2\lambda\|\beta\|_1.
\end{align*}
Substitute $Y=X\beta^*+\varepsilon$. Then
\begin{align*}
Y-X\hat{\beta}
=
\varepsilon-X(\hat{\beta}-\beta^*),
\qquad
Y-X\beta
=
\varepsilon-X(\beta-\beta^*).
\end{align*}
Expanding the Euclidean squared norm with $h=\hat{\beta}-\beta^*$ gives
\begin{align*}
\frac{1}{n}|\varepsilon-Xh|^2
=
\frac{1}{n}|\varepsilon|^2
-
\frac{2}{n}\varepsilon^\top Xh
+
\frac{1}{n}|Xh|^2.
\end{align*}
Applying the same expansion with $h=\beta-\beta^*$ gives
\begin{align*}
\frac{1}{n}|\varepsilon-Xh|^2
=
\frac{1}{n}|\varepsilon|^2
-
\frac{2}{n}\varepsilon^\top Xh
+
\frac{1}{n}|Xh|^2.
\end{align*}
After cancelling the common term $\frac{1}{n}|\varepsilon|^2$, substituting $h=\hat{\beta}-\beta^*$ and $h=\beta-\beta^*$ back into the two displayed expansions, and moving the stochastic terms to the right-hand side, we obtain
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)
+
2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1).
\end{align*}
Now we use the event $\mathcal{E}_\lambda$ exactly where it is needed: it controls the random linear score. Since $d=\hat{\beta}-\beta$, we have
\begin{align*}
\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)
=
2\left(\frac{1}{n}X^\top\varepsilon\right)^\top d.
\end{align*}
The [dual norm inequality](/page/Dual%20Norm) between $\ell^\infty$ and $\ell^1$ gives
\begin{align*}
\left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right|
\leq
2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\|d\|_1.
\end{align*}
On $\mathcal{E}_\lambda$, the factor $\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty$ is at most $\lambda/2$, hence
\begin{align*}
\left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right|
\leq
\lambda\|d\|_1.
\end{align*}
Substituting this into the basic inequality yields
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\lambda\|d\|_1
+
2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1).
\end{align*}
[/guided]
[/step]
[step:Decompose the penalty into active and inactive coordinates]
Because $\beta_{S^c}=0$, we have $d_{S^c}=\hat{\beta}_{S^c}$. Also,
\begin{align*}
\|\hat{\beta}\|_1
=
\|\beta_S+d_S\|_1+\|d_{S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/page/Triangle%20Inequality) for the $\ell^1$ norm gives
\begin{align*}
\|\beta\|_1-\|\hat{\beta}\|_1
=
\|\beta_S\|_1-\|\beta_S+d_S\|_1-\|d_{S^c}\|_1
\leq
\|d_S\|_1-\|d_{S^c}\|_1.
\end{align*}
Since $\|d\|_1=\|d_S\|_1+\|d_{S^c}\|_1$, the previous step gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
3\lambda\|d_S\|_1
-
\lambda\|d_{S^c}\|_1.
\end{align*}
[guided]
The penalty term must be separated according to the support of the comparator $\beta$, because the compatibility condition only controls the coordinates in that support. Since $S=\{j:\beta_j\neq 0\}$, the restriction of $\beta$ to $S^c$ is zero, and therefore $d_{S^c}=\hat{\beta}_{S^c}$. Hence
\begin{align*}
\|\hat{\beta}\|_1
=
\|\beta_S+d_S\|_1+
\|d_{S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/page/Triangle%20Inequality) for the $\ell^1$ norm gives
\begin{align*}
\|\beta_S\|_1-\|\beta_S+d_S\|_1
\leq
\|d_S\|_1.
\end{align*}
Using also $\|\beta\|_1=\|\beta_S\|_1$, we obtain
\begin{align*}
\|\beta\|_1-\|\hat{\beta}\|_1
=
\|\beta_S\|_1-\|\beta_S+d_S\|_1-\|d_{S^c}\|_1
\leq
\|d_S\|_1-\|d_{S^c}\|_1.
\end{align*}
The earlier basic inequality contains the two penalty contributions $\lambda\|d\|_1$ and $2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1)$. Since $\|d\|_1=\|d_S\|_1+\|d_{S^c}\|_1$, substitution first gives
\begin{align*}
\lambda\|d\|_1+2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1)
\leq
\lambda(\|d_S\|_1+\|d_{S^c}\|_1)+2\lambda(\|d_S\|_1-\|d_{S^c}\|_1).
\end{align*}
Collecting the active and inactive coordinate terms gives
\begin{align*}
\lambda(\|d_S\|_1+\|d_{S^c}\|_1)+2\lambda(\|d_S\|_1-\|d_{S^c}\|_1)
=
3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1.
\end{align*}
Therefore
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
3\lambda\|d_S\|_1
-
\lambda\|d_{S^c}\|_1.
\end{align*}
This is the point where sparsity enters: the active coordinates produce a positive error term, while the inactive coordinates produce a negative term that can either settle the proof immediately or force a cone condition.
[/guided]
[/step]
[step:Split according to whether the error lies in the compatibility cone]
If
\begin{align*}
\|d_{S^c}\|_1 > 3\|d_S\|_1,
\end{align*}
then
\begin{align*}
3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1 < 0,
\end{align*}
and hence
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2.
\end{align*}
This is already stronger than the desired estimate.
It remains to consider the case
\begin{align*}
\|d_{S^c}\|_1 \leq 3\|d_S\|_1.
\end{align*}
If $s=0$, then $S=\varnothing$, so $d_S=0$ and the cone condition forces $d_{S^c}=0$, hence $d=0$ and the desired estimate follows. Assume therefore that $1 \leq s \leq m$.
[guided]
From the penalty decomposition we have
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
3\lambda\|d_S\|_1
-
\lambda\|d_{S^c}\|_1.
\end{align*}
There are two possibilities. If
\begin{align*}
\|d_{S^c}\|_1 > 3\|d_S\|_1,
\end{align*}
then the penalty remainder is strictly negative:
\begin{align*}
3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1 < 0.
\end{align*}
Consequently
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2,
\end{align*}
which is stronger than the desired oracle bound because the additional term $\lambda^2\|\beta\|_0/\phi^2$ is nonnegative.
The only remaining case is therefore
\begin{align*}
\|d_{S^c}\|_1 \leq 3\|d_S\|_1.
\end{align*}
This is exactly the compatibility cone. If $s=0$, then $S=\varnothing$, so $d_S=0$. The cone inequality then gives $\|d_{S^c}\|_1\leq 0$, hence $d_{S^c}=0$ and $d=0$. Thus the proof is complete in the zero-support case. We may therefore assume $1\leq s\leq m$, which is precisely the range in which the uniform compatibility hypothesis can be applied to $S$.
[/guided]
[/step]
[step:Use compatibility to convert active \ell^1 error into prediction error]
Since $1 \leq s \leq m$ and $d$ satisfies the cone condition
\begin{align*}
\|d_{S^c}\|_1 \leq 3\|d_S\|_1,
\end{align*}
we are in the cone convention used in the [compatibility condition](/page/Compatibility%20Condition), namely cone constant $3$. Applying that condition to the set $S$ and the vector $d$ gives
\begin{align*}
\frac{1}{n}|Xd|^2 \geq \frac{\phi^2}{s}\|d_S\|_1^2.
\end{align*}
Therefore
\begin{align*}
\|d_S\|_1
\leq
\frac{\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}.
\end{align*}
Substituting this into the penalty bound and dropping the nonpositive term $-\lambda\|d_{S^c}\|_1$ gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}.
\end{align*}
[guided]
In the remaining case the error vector $d$ lies in the cone
\begin{align*}
\|d_{S^c}\|_1 \leq 3\|d_S\|_1.
\end{align*}
The set $S$ has size $s$ with $1\leq s\leq m$, so it is one of the supports covered by the uniform compatibility assumption. We use the cone-constant-$3$ convention in the [compatibility condition](/page/Compatibility%20Condition): it applies to vectors satisfying $\|d_{S^c}\|_1\leq 3\|d_S\|_1$. Since this is exactly the cone inequality above, applying the compatibility condition to this support $S$ and this vector $d$ gives
\begin{align*}
\frac{1}{n}|Xd|^2 \geq \frac{\phi^2}{s}\|d_S\|_1^2.
\end{align*}
Taking square roots is valid because both sides are nonnegative, and it gives
\begin{align*}
\|d_S\|_1
\leq
\frac{\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}.
\end{align*}
We substitute this estimate into the penalty-decomposed inequality
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
3\lambda\|d_S\|_1
-
\lambda\|d_{S^c}\|_1.
\end{align*}
The final term $-\lambda\|d_{S^c}\|_1$ is nonpositive because $\lambda\geq 0$ and $\|d_{S^c}\|_1\geq 0$, so dropping it only weakens the inequality. Therefore
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}.
\end{align*}
This is the bridge from sparse geometry to prediction error: the active-coordinate $\ell^1$ term has been replaced by the prediction norm of $Xd$.
[/guided]
[/step]
[step:Absorb the remaining prediction term]
Define $A \in [0,\infty)$ by
\begin{align*}
A := \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2.
\end{align*}
Define $B \in [0,\infty)$ by
\begin{align*}
B := \frac{1}{n}|X(\beta-\beta^*)|^2.
\end{align*}
Define $a \in [0,\infty)$ by
\begin{align*}
a := \frac{3\lambda\sqrt{s}}{\phi}.
\end{align*}
Since
\begin{align*}
Xd = X(\hat{\beta}-\beta)=X(\hat{\beta}-\beta^*)-X(\beta-\beta^*),
\end{align*}
the [Euclidean triangle inequality](/page/Triangle%20Inequality) gives
\begin{align*}
\frac{|Xd|}{\sqrt{n}} \leq \sqrt{A}+\sqrt{B}.
\end{align*}
Thus
\begin{align*}
A \leq B+a\sqrt{A}+a\sqrt{B}.
\end{align*}
Using [Young's inequality](/page/Young%27s%20Inequality) in the elementary form $uv \leq \frac{1}{2}u^2+\frac{1}{2}v^2$ with $(u,v)=(\sqrt{A},a)$ gives
\begin{align*}
a\sqrt{A} \leq \frac{1}{2}A+\frac{1}{2}a^2.
\end{align*}
Using the same inequality with $(u,v)=(\sqrt{B},a)$ gives
\begin{align*}
a\sqrt{B} \leq \frac{1}{2}B+\frac{1}{2}a^2.
\end{align*}
Consequently,
\begin{align*}
A
\leq
B+\frac{1}{2}A+\frac{1}{2}a^2+\frac{1}{2}B+\frac{1}{2}a^2
=
\frac{1}{2}A+\frac{3}{2}B+a^2.
\end{align*}
Subtracting $\frac{1}{2}A$ from both sides yields
\begin{align*}
A \leq 3B+2a^2.
\end{align*}
Since $a^2=9\lambda^2s/\phi^2$, we obtain
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
3\frac{1}{n}|X(\beta-\beta^*)|^2
+
18\frac{\lambda^2 s}{\phi^2}.
\end{align*}
Because $s=\|\beta\|_0$, this implies
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
18\left\{
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{\lambda^2\|\beta\|_0}{\phi^2}
\right\}.
\end{align*}
[guided]
The compatibility step left us with the estimate
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}.
\end{align*}
We now convert the remaining $|Xd|$ term into the two prediction errors already present in the theorem. Define $A \in [0,\infty)$ by
\begin{align*}
A := \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2.
\end{align*}
Define $B \in [0,\infty)$ by
\begin{align*}
B := \frac{1}{n}|X(\beta-\beta^*)|^2.
\end{align*}
Define $a \in [0,\infty)$ by
\begin{align*}
a := \frac{3\lambda\sqrt{s}}{\phi}.
\end{align*}
Since
\begin{align*}
Xd=X(\hat{\beta}-\beta)=X(\hat{\beta}-\beta^*)-X(\beta-\beta^*),
\end{align*}
the [Euclidean triangle inequality](/page/Triangle%20Inequality) gives
\begin{align*}
\frac{|Xd|}{\sqrt{n}} \leq \sqrt{A}+\sqrt{B}.
\end{align*}
Therefore
\begin{align*}
A \leq B+a\sqrt{A}+a\sqrt{B}.
\end{align*}
We apply [Young's inequality](/page/Young%27s%20Inequality) in the form $uv\leq \frac{1}{2}u^2+\frac{1}{2}v^2$. With $(u,v)=(\sqrt{A},a)$,
\begin{align*}
a\sqrt{A} \leq \frac{1}{2}A+\frac{1}{2}a^2.
\end{align*}
With $(u,v)=(\sqrt{B},a)$,
\begin{align*}
a\sqrt{B} \leq \frac{1}{2}B+\frac{1}{2}a^2.
\end{align*}
Substituting both bounds gives
\begin{align*}
A
\leq
B+\frac{1}{2}A+\frac{1}{2}a^2+\frac{1}{2}B+\frac{1}{2}a^2
=
\frac{1}{2}A+\frac{3}{2}B+a^2.
\end{align*}
Subtracting $\frac{1}{2}A$ from both sides yields
\begin{align*}
A\leq 3B+2a^2.
\end{align*}
Because $a^2=9\lambda^2s/\phi^2$ and $s=\|\beta\|_0$, this becomes
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
3\frac{1}{n}|X(\beta-\beta^*)|^2
+
18\frac{\lambda^2\|\beta\|_0}{\phi^2}.
\end{align*}
Since $3\leq 18$, we may use the single universal constant $18$ on both terms:
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
18\left\{
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{\lambda^2\|\beta\|_0}{\phi^2}
\right\}.
\end{align*}
[/guided]
[/step]
[step:Take the infimum over all sparse comparators]
The preceding estimate holds on $\mathcal{E}_\lambda$ for every $\beta \in \mathbb{R}^p$ satisfying $\|\beta\|_0 \leq m$. Therefore taking the infimum over all such $\beta$ gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
18
\inf_{\beta \in \mathbb{R}^p,\ \|\beta\|_0 \leq m}
\left\{
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{\lambda^2\|\beta\|_0}{\phi^2}
\right\}.
\end{align*}
This is the claimed oracle inequality.
[guided]
The estimate just proved was derived after fixing an arbitrary comparator $\beta\in\mathbb{R}^p$ with $\|\beta\|_0\leq m$. No later constant depends on this particular comparator, so the same bound holds simultaneously for every such $\beta$ on the event $\mathcal{E}_\lambda$:
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
18\left\{
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{\lambda^2\|\beta\|_0}{\phi^2}
\right\}.
\end{align*}
Taking the infimum over the set of all sparse comparators $\{\beta\in\mathbb{R}^p:\|\beta\|_0\leq m\}$ gives
\begin{align*}
\frac{1}{n}|X(\hat{\beta}-\beta^*)|^2
\leq
18
\inf_{\beta \in \mathbb{R}^p,\ \|\beta\|_0 \leq m}
\left\{
\frac{1}{n}|X(\beta-\beta^*)|^2
+
\frac{\lambda^2\|\beta\|_0}{\phi^2}
\right\}.
\end{align*}
This is the oracle inequality with the universal constant $C=18$.
[/guided]
[/step]