[proofplan]
We prove the two estimates separately for the Lasso and the Dantzig selector, but the structure is the same in both cases. First, a deterministic optimality or feasibility inequality places the estimation error $\Delta = \hat\beta - \beta^*$ in the relevant cone. Then the restricted eigenvalue lower bound converts $\|\Delta_S\|_2$ into prediction norm, while sparsity converts $\|\Delta_S\|_1$ into $\sqrt{s}\|\Delta_S\|_2$. Combining these inequalities gives the stated prediction and $\ell^1$ rates with universal constants.
[/proofplan]
[step:Derive the Lasso basic inequality from optimality]
Define the Lasso error vector $\Delta_{\mathrm{L}} \in \mathbb{R}^p$ by
\begin{align*}
\Delta_{\mathrm{L}} := \hat\beta_{\mathrm{L}} - \beta^*.
\end{align*}
Since $y = X\beta^* + \varepsilon$, the Lasso objective at $\hat\beta_{\mathrm{L}} = \beta^* + \Delta_{\mathrm{L}}$ is no larger than the objective at $\beta^*$. Hence
\begin{align*}
\frac{1}{2n}\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2 + \lambda\|\beta^* + \Delta_{\mathrm{L}}\|_1
\leq
\frac{1}{2n}\|\varepsilon\|_2^2 + \lambda\|\beta^*\|_1.
\end{align*}
Expanding the squared norm in the Euclidean [inner product](/page/Inner%20Product) on $\mathbb{R}^n$ gives
\begin{align*}
\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2
=
\|\varepsilon\|_2^2
- 2\varepsilon^\top X\Delta_{\mathrm{L}}
+ \|X\Delta_{\mathrm{L}}\|_2^2.
\end{align*}
Substituting this identity and cancelling $\|\varepsilon\|_2^2/(2n)$ yields the [Lasso basic inequality](/theorems/5555)
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n}
\leq
\frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n}
+
\lambda\left(\|\beta^*\|_1 - \|\beta^* + \Delta_{\mathrm{L}}\|_1\right).
\end{align*}
[guided]
The only input in this step is the defining minimisation property of $\hat\beta_{\mathrm{L}}$. Because $\hat\beta_{\mathrm{L}}$ minimises the map
\begin{align*}
\beta \mapsto \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1
\end{align*}
over $\mathbb{R}^p$, we may compare the value at $\hat\beta_{\mathrm{L}}$ with the value at the true vector $\beta^*$. Writing
\begin{align*}
\Delta_{\mathrm{L}} := \hat\beta_{\mathrm{L}} - \beta^*
\end{align*}
and using $y = X\beta^* + \varepsilon$, we have
\begin{align*}
y - X\hat\beta_{\mathrm{L}}
=
X\beta^* + \varepsilon - X(\beta^* + \Delta_{\mathrm{L}})
=
\varepsilon - X\Delta_{\mathrm{L}}.
\end{align*}
Therefore optimality gives
\begin{align*}
\frac{1}{2n}\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2
+
\lambda\|\beta^* + \Delta_{\mathrm{L}}\|_1
\leq
\frac{1}{2n}\|\varepsilon\|_2^2
+
\lambda\|\beta^*\|_1.
\end{align*}
The purpose of expanding the square is to isolate the prediction error $\|X\Delta_{\mathrm{L}}\|_2^2/n$, which is one of the quantities we want to bound. The Euclidean identity
\begin{align*}
\|a-b\|_2^2 = \|a\|_2^2 - 2a^\top b + \|b\|_2^2
\end{align*}
applied with $a=\varepsilon$ and $b=X\Delta_{\mathrm{L}}$ gives
\begin{align*}
\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2
=
\|\varepsilon\|_2^2
-
2\varepsilon^\top X\Delta_{\mathrm{L}}
+
\|X\Delta_{\mathrm{L}}\|_2^2.
\end{align*}
After substitution and cancellation of the common term $\|\varepsilon\|_2^2/(2n)$, we obtain
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n}
\leq
\frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n}
+
\lambda\left(\|\beta^*\|_1 - \|\beta^* + \Delta_{\mathrm{L}}\|_1\right).
\end{align*}
This is the basic inequality: it converts optimisation into an inequality involving the statistical score and the $\ell^1$ geometry of the support $S$.
[/guided]
[/step]
[step:Place the Lasso error in the cone with constant $3$]
By Hölder's inequality for the dual pair $(\ell^\infty,\ell^1)$ and the score event,
\begin{align*}
\left|\frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n}\right|
=
\left|\left(\frac{X^\top\varepsilon}{n}\right)^\top \Delta_{\mathrm{L}}\right|
\leq
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty\|\Delta_{\mathrm{L}}\|_1
\leq
\frac{\lambda}{2}\|\Delta_{\mathrm{L}}\|_1.
\end{align*}
Since $\operatorname{supp}(\beta^*)=S$, we have $\beta^*_{S^c}=0$. Using the triangle inequality on $S$ and $S^c$ separately,
First,
\begin{align*}
\|\beta^*\|_1 - \|\beta^*+\Delta_{\mathrm{L}}\|_1
=
\|\beta^*_S\|_1
-
\|\beta^*_S+\Delta_{\mathrm{L},S}\|_1
-
\|\Delta_{\mathrm{L},S^c}\|_1.
\end{align*}
The triangle inequality on the coordinates in $S$ gives
\begin{align*}
\|\beta^*_S\|_1
-
\|\beta^*_S+\Delta_{\mathrm{L},S}\|_1
\leq
\|\Delta_{\mathrm{L},S}\|_1,
\end{align*}
and therefore
\begin{align*}
\|\beta^*\|_1 - \|\beta^*+\Delta_{\mathrm{L}}\|_1
\leq
\|\Delta_{\mathrm{L},S}\|_1
-
\|\Delta_{\mathrm{L},S^c}\|_1.
\end{align*}
Substituting these two estimates into the Lasso basic inequality gives
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n}
\leq
\frac{\lambda}{2}\|\Delta_{\mathrm{L}}\|_1
+
\lambda\|\Delta_{\mathrm{L},S}\|_1
-
\lambda\|\Delta_{\mathrm{L},S^c}\|_1.
\end{align*}
Because $\|\Delta_{\mathrm{L}}\|_1=\|\Delta_{\mathrm{L},S}\|_1+\|\Delta_{\mathrm{L},S^c}\|_1$, this becomes
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n}
\leq
\frac{3\lambda}{2}\|\Delta_{\mathrm{L},S}\|_1
-
\frac{\lambda}{2}\|\Delta_{\mathrm{L},S^c}\|_1.
\end{align*}
The left-hand side is non-negative, so
\begin{align*}
\|\Delta_{\mathrm{L},S^c}\|_1 \leq 3\|\Delta_{\mathrm{L},S}\|_1.
\end{align*}
[/step]
[step:Convert the Lasso cone inequality into prediction and $\ell^1$ bounds]
Since $\Delta_{\mathrm{L}}$ lies in the cone $\|\Delta_{\mathrm{L},S^c}\|_1 \leq 3\|\Delta_{\mathrm{L},S}\|_1$, the restricted eigenvalue hypothesis gives
\begin{align*}
\|\Delta_{\mathrm{L},S}\|_2
\leq
\frac{1}{\kappa}\frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}}.
\end{align*}
By Cauchy-Schwarz on the finite set $S$,
\begin{align*}
\|\Delta_{\mathrm{L},S}\|_1
\leq
\sqrt{s}\|\Delta_{\mathrm{L},S}\|_2
\leq
\frac{\sqrt{s}}{\kappa}\frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}}.
\end{align*}
From the previous step,
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n}
\leq
\frac{3\lambda}{2}\|\Delta_{\mathrm{L},S}\|_1.
\end{align*}
Combining the last two inequalities and writing
\begin{align*}
A_{\mathrm{L}} := \frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}},
\end{align*}
we obtain
\begin{align*}
\frac{A_{\mathrm{L}}^2}{2}
\leq
\frac{3\lambda\sqrt{s}}{2\kappa}A_{\mathrm{L}}.
\end{align*}
If $A_{\mathrm{L}}=0$, the prediction bound is immediate. If $A_{\mathrm{L}}>0$, division by $A_{\mathrm{L}}$ gives
\begin{align*}
A_{\mathrm{L}}
\leq
\frac{3\lambda\sqrt{s}}{\kappa},
\end{align*}
and hence
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{n}
=
A_{\mathrm{L}}^2
\leq
\frac{9\lambda^2s}{\kappa^2}.
\end{align*}
Finally, the cone inequality gives
\begin{align*}
\|\Delta_{\mathrm{L}}\|_1
=
\|\Delta_{\mathrm{L},S}\|_1+\|\Delta_{\mathrm{L},S^c}\|_1
\leq
4\|\Delta_{\mathrm{L},S}\|_1
\leq
\frac{4\sqrt{s}}{\kappa}A_{\mathrm{L}}
\leq
\frac{12\lambda s}{\kappa^2}.
\end{align*}
[/step]
[step:Place the Dantzig selector error in the cone with constant $1$]
Define the Dantzig selector error vector $\Delta_{\mathrm{DS}} \in \mathbb{R}^p$ by
\begin{align*}
\Delta_{\mathrm{DS}} := \hat\beta_{\mathrm{DS}} - \beta^*.
\end{align*}
The score event implies that $\beta^*$ is feasible for the Dantzig constraint because
\begin{align*}
\left\|\frac{X^\top(y-X\beta^*)}{n}\right\|_\infty
=
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty
\leq
\frac{\lambda}{2}
\leq
\lambda.
\end{align*}
Since $\hat\beta_{\mathrm{DS}}$ minimises the $\ell^1$ norm over the feasible set,
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
=
\|\hat\beta_{\mathrm{DS}}\|_1
\leq
\|\beta^*\|_1.
\end{align*}
Using $\operatorname{supp}(\beta^*)=S$ and the same support decomposition as above,
First,
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
=
\|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1
+
\|\Delta_{\mathrm{DS},S^c}\|_1.
\end{align*}
The [reverse triangle inequality](/theorems/2300) on the coordinates in $S$ gives
\begin{align*}
\|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1
\geq
\|\beta^*_S\|_1
-
\|\Delta_{\mathrm{DS},S}\|_1,
\end{align*}
and hence
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
\geq
\|\beta^*_S\|_1
-
\|\Delta_{\mathrm{DS},S}\|_1
+
\|\Delta_{\mathrm{DS},S^c}\|_1.
\end{align*}
Combining this lower bound with $\|\beta^*+\Delta_{\mathrm{DS}}\|_1 \leq \|\beta^*_S\|_1$ gives
\begin{align*}
\|\Delta_{\mathrm{DS},S^c}\|_1
\leq
\|\Delta_{\mathrm{DS},S}\|_1.
\end{align*}
[guided]
The Dantzig selector is controlled by feasibility rather than by a quadratic optimality inequality. First we verify that the true vector $\beta^*$ is allowed in the Dantzig optimisation problem. Since $y=X\beta^*+\varepsilon$, the residual at $\beta^*$ is
\begin{align*}
y-X\beta^*=\varepsilon.
\end{align*}
Therefore
\begin{align*}
\left\|\frac{X^\top(y-X\beta^*)}{n}\right\|_\infty
=
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty
\leq
\frac{\lambda}{2}
\leq
\lambda.
\end{align*}
Thus $\beta^*$ belongs to the feasible set in the definition of $\hat\beta_{\mathrm{DS}}$.
Because $\hat\beta_{\mathrm{DS}}$ minimises $\|\beta\|_1$ over that feasible set, its $\ell^1$ norm is no larger than the $\ell^1$ norm of the feasible point $\beta^*$. With
\begin{align*}
\Delta_{\mathrm{DS}} := \hat\beta_{\mathrm{DS}}-\beta^*,
\end{align*}
this says
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
\leq
\|\beta^*\|_1.
\end{align*}
Now we use sparsity. Since $S=\operatorname{supp}(\beta^*)$, the vector $\beta^*$ vanishes on $S^c$. Hence
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
=
\|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1
+
\|\Delta_{\mathrm{DS},S^c}\|_1.
\end{align*}
The reverse triangle inequality on the coordinates in $S$ gives
\begin{align*}
\|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1
\geq
\|\beta^*_S\|_1-\|\Delta_{\mathrm{DS},S}\|_1.
\end{align*}
Therefore
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1
\geq
\|\beta^*_S\|_1-\|\Delta_{\mathrm{DS},S}\|_1+\|\Delta_{\mathrm{DS},S^c}\|_1.
\end{align*}
Combining this with
\begin{align*}
\|\beta^*+\Delta_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1=\|\beta^*_S\|_1
\end{align*}
and cancelling $\|\beta^*_S\|_1$ yields
\begin{align*}
\|\Delta_{\mathrm{DS},S^c}\|_1
\leq
\|\Delta_{\mathrm{DS},S}\|_1.
\end{align*}
This is the sharper Dantzig cone condition: its constant is $1$ because the Dantzig argument uses direct $\ell^1$ minimality rather than the Lasso basic inequality with an additional score term.
[/guided]
[/step]
[step:Convert Dantzig feasibility into prediction and $\ell^1$ bounds]
Since both $\hat\beta_{\mathrm{DS}}$ and $\beta^*$ are feasible up to radii $\lambda$ and $\lambda/2$, respectively,
The identity
\begin{align*}
X\hat\beta_{\mathrm{DS}}-X\beta^*
=
(y-X\beta^*)-(y-X\hat\beta_{\mathrm{DS}})
\end{align*}
gives
\begin{align*}
\left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty
=
\left\|\frac{X^\top(y-X\beta^*)}{n}
-
\frac{X^\top(y-X\hat\beta_{\mathrm{DS}})}{n}
\right\|_\infty.
\end{align*}
Applying the triangle inequality in $\ell^\infty$ and using the score event together with Dantzig feasibility of $\hat\beta_{\mathrm{DS}}$ yields
\begin{align*}
\left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty
\leq
\left\|\frac{X^\top\varepsilon}{n}\right\|_\infty
+
\left\|\frac{X^\top(y-X\hat\beta_{\mathrm{DS}})}{n}\right\|_\infty
\leq
\frac{3\lambda}{2}.
\end{align*}
Therefore, using Hölder's inequality for $(\ell^1,\ell^\infty)$,
\begin{align*}
\frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n}
=
\Delta_{\mathrm{DS}}^\top\frac{X^\top X\Delta_{\mathrm{DS}}}{n}
\leq
\|\Delta_{\mathrm{DS}}\|_1
\left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty
\leq
\frac{3\lambda}{2}\|\Delta_{\mathrm{DS}}\|_1.
\end{align*}
The cone condition gives
\begin{align*}
\|\Delta_{\mathrm{DS}}\|_1
\leq
2\|\Delta_{\mathrm{DS},S}\|_1
\leq
2\sqrt{s}\|\Delta_{\mathrm{DS},S}\|_2.
\end{align*}
Since $\Delta_{\mathrm{DS}}$ lies in the cone $\|\Delta_{\mathrm{DS},S^c}\|_1 \leq \|\Delta_{\mathrm{DS},S}\|_1$, the restricted eigenvalue hypothesis gives
\begin{align*}
\|\Delta_{\mathrm{DS},S}\|_2
\leq
\frac{1}{\kappa}\frac{\|X\Delta_{\mathrm{DS}}\|_2}{\sqrt{n}}.
\end{align*}
Writing
\begin{align*}
A_{\mathrm{DS}} := \frac{\|X\Delta_{\mathrm{DS}}\|_2}{\sqrt{n}},
\end{align*}
we obtain
\begin{align*}
A_{\mathrm{DS}}^2
\leq
\frac{3\lambda}{2}\cdot
2\sqrt{s}\cdot
\frac{A_{\mathrm{DS}}}{\kappa}
=
\frac{3\lambda\sqrt{s}}{\kappa}A_{\mathrm{DS}}.
\end{align*}
If $A_{\mathrm{DS}}=0$, the prediction bound is immediate. If $A_{\mathrm{DS}}>0$, division by $A_{\mathrm{DS}}$ gives
\begin{align*}
A_{\mathrm{DS}}
\leq
\frac{3\lambda\sqrt{s}}{\kappa},
\end{align*}
so
\begin{align*}
\frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n}
=
A_{\mathrm{DS}}^2
\leq
\frac{9\lambda^2s}{\kappa^2}.
\end{align*}
The $\ell^1$ estimate follows from the preceding cone and restricted eigenvalue bounds:
\begin{align*}
\|\Delta_{\mathrm{DS}}\|_1
\leq
\frac{2\sqrt{s}}{\kappa}A_{\mathrm{DS}}
\leq
\frac{6\lambda s}{\kappa^2}
\leq
\frac{12\lambda s}{\kappa^2}.
\end{align*}
[/step]
[step:Combine the two estimator-specific estimates]
For the Lasso error $\Delta_{\mathrm{L}}=\hat\beta_{\mathrm{L}}-\beta^*$, we proved
\begin{align*}
\frac{\|X\Delta_{\mathrm{L}}\|_2^2}{n}
\leq
\frac{9\lambda^2s}{\kappa^2},
\qquad
\|\Delta_{\mathrm{L}}\|_1
\leq
\frac{12\lambda s}{\kappa^2}.
\end{align*}
For the Dantzig selector error $\Delta_{\mathrm{DS}}=\hat\beta_{\mathrm{DS}}-\beta^*$, we proved
\begin{align*}
\frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n}
\leq
\frac{9\lambda^2s}{\kappa^2},
\qquad
\|\Delta_{\mathrm{DS}}\|_1
\leq
\frac{12\lambda s}{\kappa^2}.
\end{align*}
Thus the same prediction and $\ell^1$ rates hold for each $\hat\beta \in \{\hat\beta_{\mathrm{L}},\hat\beta_{\mathrm{DS}}\}$, which is the desired comparison.
[/step]