Comparison of Lasso and Dantzig Selector Rates — Statement & Proof

Comparison of Lasso and Dantzig Selector Rates (Theorem # 5577)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the two estimates separately for the Lasso and the Dantzig selector, but the structure is the same in both cases. First, a deterministic optimality or feasibility inequality places the estimation error $\Delta = \hat\beta - \beta^*$ in the relevant cone. Then the restricted eigenvalue lower bound converts $\|\Delta_S\|_2$ into prediction norm, while sparsity converts $\|\Delta_S\|_1$ into $\sqrt{s}\|\Delta_S\|_2$. Combining these inequalities gives the stated prediction and $\ell^1$ rates with universal constants. [/proofplan] [step:Derive the Lasso basic inequality from optimality] Define the Lasso error vector $\Delta_{\mathrm{L}} \in \mathbb{R}^p$ by \begin{align*} \Delta_{\mathrm{L}} := \hat\beta_{\mathrm{L}} - \beta^*. \end{align*} Since $y = X\beta^* + \varepsilon$, the Lasso objective at $\hat\beta_{\mathrm{L}} = \beta^* + \Delta_{\mathrm{L}}$ is no larger than the objective at $\beta^*$. Hence \begin{align*} \frac{1}{2n}\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2 + \lambda\|\beta^* + \Delta_{\mathrm{L}}\|_1 \leq \frac{1}{2n}\|\varepsilon\|_2^2 + \lambda\|\beta^*\|_1. \end{align*} Expanding the squared norm in the Euclidean [inner product](/page/Inner%20Product) on $\mathbb{R}^n$ gives \begin{align*} \|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2 = \|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta_{\mathrm{L}} + \|X\Delta_{\mathrm{L}}\|_2^2. \end{align*} Substituting this identity and cancelling $\|\varepsilon\|_2^2/(2n)$ yields the [Lasso basic inequality](/theorems/5555) \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n} \leq \frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n} + \lambda\left(\|\beta^*\|_1 - \|\beta^* + \Delta_{\mathrm{L}}\|_1\right). \end{align*} [guided] The only input in this step is the defining minimisation property of $\hat\beta_{\mathrm{L}}$. Because $\hat\beta_{\mathrm{L}}$ minimises the map \begin{align*} \beta \mapsto \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1 \end{align*} over $\mathbb{R}^p$, we may compare the value at $\hat\beta_{\mathrm{L}}$ with the value at the true vector $\beta^*$. Writing \begin{align*} \Delta_{\mathrm{L}} := \hat\beta_{\mathrm{L}} - \beta^* \end{align*} and using $y = X\beta^* + \varepsilon$, we have \begin{align*} y - X\hat\beta_{\mathrm{L}} = X\beta^* + \varepsilon - X(\beta^* + \Delta_{\mathrm{L}}) = \varepsilon - X\Delta_{\mathrm{L}}. \end{align*} Therefore optimality gives \begin{align*} \frac{1}{2n}\|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2 + \lambda\|\beta^* + \Delta_{\mathrm{L}}\|_1 \leq \frac{1}{2n}\|\varepsilon\|_2^2 + \lambda\|\beta^*\|_1. \end{align*} The purpose of expanding the square is to isolate the prediction error $\|X\Delta_{\mathrm{L}}\|_2^2/n$, which is one of the quantities we want to bound. The Euclidean identity \begin{align*} \|a-b\|_2^2 = \|a\|_2^2 - 2a^\top b + \|b\|_2^2 \end{align*} applied with $a=\varepsilon$ and $b=X\Delta_{\mathrm{L}}$ gives \begin{align*} \|\varepsilon - X\Delta_{\mathrm{L}}\|_2^2 = \|\varepsilon\|_2^2 - 2\varepsilon^\top X\Delta_{\mathrm{L}} + \|X\Delta_{\mathrm{L}}\|_2^2. \end{align*} After substitution and cancellation of the common term $\|\varepsilon\|_2^2/(2n)$, we obtain \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n} \leq \frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n} + \lambda\left(\|\beta^*\|_1 - \|\beta^* + \Delta_{\mathrm{L}}\|_1\right). \end{align*} This is the basic inequality: it converts optimisation into an inequality involving the statistical score and the $\ell^1$ geometry of the support $S$. [/guided] [/step] [step:Place the Lasso error in the cone with constant $3$] By Hölder's inequality for the dual pair $(\ell^\infty,\ell^1)$ and the score event, \begin{align*} \left|\frac{\varepsilon^\top X\Delta_{\mathrm{L}}}{n}\right| = \left|\left(\frac{X^\top\varepsilon}{n}\right)^\top \Delta_{\mathrm{L}}\right| \leq \left\|\frac{X^\top\varepsilon}{n}\right\|_\infty\|\Delta_{\mathrm{L}}\|_1 \leq \frac{\lambda}{2}\|\Delta_{\mathrm{L}}\|_1. \end{align*} Since $\operatorname{supp}(\beta^*)=S$, we have $\beta^*_{S^c}=0$. Using the triangle inequality on $S$ and $S^c$ separately, First, \begin{align*} \|\beta^*\|_1 - \|\beta^*+\Delta_{\mathrm{L}}\|_1 = \|\beta^*_S\|_1 - \|\beta^*_S+\Delta_{\mathrm{L},S}\|_1 - \|\Delta_{\mathrm{L},S^c}\|_1. \end{align*} The triangle inequality on the coordinates in $S$ gives \begin{align*} \|\beta^*_S\|_1 - \|\beta^*_S+\Delta_{\mathrm{L},S}\|_1 \leq \|\Delta_{\mathrm{L},S}\|_1, \end{align*} and therefore \begin{align*} \|\beta^*\|_1 - \|\beta^*+\Delta_{\mathrm{L}}\|_1 \leq \|\Delta_{\mathrm{L},S}\|_1 - \|\Delta_{\mathrm{L},S^c}\|_1. \end{align*} Substituting these two estimates into the Lasso basic inequality gives \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n} \leq \frac{\lambda}{2}\|\Delta_{\mathrm{L}}\|_1 + \lambda\|\Delta_{\mathrm{L},S}\|_1 - \lambda\|\Delta_{\mathrm{L},S^c}\|_1. \end{align*} Because $\|\Delta_{\mathrm{L}}\|_1=\|\Delta_{\mathrm{L},S}\|_1+\|\Delta_{\mathrm{L},S^c}\|_1$, this becomes \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n} \leq \frac{3\lambda}{2}\|\Delta_{\mathrm{L},S}\|_1 - \frac{\lambda}{2}\|\Delta_{\mathrm{L},S^c}\|_1. \end{align*} The left-hand side is non-negative, so \begin{align*} \|\Delta_{\mathrm{L},S^c}\|_1 \leq 3\|\Delta_{\mathrm{L},S}\|_1. \end{align*} [/step] [step:Convert the Lasso cone inequality into prediction and $\ell^1$ bounds] Since $\Delta_{\mathrm{L}}$ lies in the cone $\|\Delta_{\mathrm{L},S^c}\|_1 \leq 3\|\Delta_{\mathrm{L},S}\|_1$, the restricted eigenvalue hypothesis gives \begin{align*} \|\Delta_{\mathrm{L},S}\|_2 \leq \frac{1}{\kappa}\frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}}. \end{align*} By Cauchy-Schwarz on the finite set $S$, \begin{align*} \|\Delta_{\mathrm{L},S}\|_1 \leq \sqrt{s}\|\Delta_{\mathrm{L},S}\|_2 \leq \frac{\sqrt{s}}{\kappa}\frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}}. \end{align*} From the previous step, \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{2n} \leq \frac{3\lambda}{2}\|\Delta_{\mathrm{L},S}\|_1. \end{align*} Combining the last two inequalities and writing \begin{align*} A_{\mathrm{L}} := \frac{\|X\Delta_{\mathrm{L}}\|_2}{\sqrt{n}}, \end{align*} we obtain \begin{align*} \frac{A_{\mathrm{L}}^2}{2} \leq \frac{3\lambda\sqrt{s}}{2\kappa}A_{\mathrm{L}}. \end{align*} If $A_{\mathrm{L}}=0$, the prediction bound is immediate. If $A_{\mathrm{L}}>0$, division by $A_{\mathrm{L}}$ gives \begin{align*} A_{\mathrm{L}} \leq \frac{3\lambda\sqrt{s}}{\kappa}, \end{align*} and hence \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{n} = A_{\mathrm{L}}^2 \leq \frac{9\lambda^2s}{\kappa^2}. \end{align*} Finally, the cone inequality gives \begin{align*} \|\Delta_{\mathrm{L}}\|_1 = \|\Delta_{\mathrm{L},S}\|_1+\|\Delta_{\mathrm{L},S^c}\|_1 \leq 4\|\Delta_{\mathrm{L},S}\|_1 \leq \frac{4\sqrt{s}}{\kappa}A_{\mathrm{L}} \leq \frac{12\lambda s}{\kappa^2}. \end{align*} [/step] [step:Place the Dantzig selector error in the cone with constant $1$] Define the Dantzig selector error vector $\Delta_{\mathrm{DS}} \in \mathbb{R}^p$ by \begin{align*} \Delta_{\mathrm{DS}} := \hat\beta_{\mathrm{DS}} - \beta^*. \end{align*} The score event implies that $\beta^*$ is feasible for the Dantzig constraint because \begin{align*} \left\|\frac{X^\top(y-X\beta^*)}{n}\right\|_\infty = \left\|\frac{X^\top\varepsilon}{n}\right\|_\infty \leq \frac{\lambda}{2} \leq \lambda. \end{align*} Since $\hat\beta_{\mathrm{DS}}$ minimises the $\ell^1$ norm over the feasible set, \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 = \|\hat\beta_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1. \end{align*} Using $\operatorname{supp}(\beta^*)=S$ and the same support decomposition as above, First, \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 = \|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1 + \|\Delta_{\mathrm{DS},S^c}\|_1. \end{align*} The [reverse triangle inequality](/theorems/2300) on the coordinates in $S$ gives \begin{align*} \|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1 \geq \|\beta^*_S\|_1 - \|\Delta_{\mathrm{DS},S}\|_1, \end{align*} and hence \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 \geq \|\beta^*_S\|_1 - \|\Delta_{\mathrm{DS},S}\|_1 + \|\Delta_{\mathrm{DS},S^c}\|_1. \end{align*} Combining this lower bound with $\|\beta^*+\Delta_{\mathrm{DS}}\|_1 \leq \|\beta^*_S\|_1$ gives \begin{align*} \|\Delta_{\mathrm{DS},S^c}\|_1 \leq \|\Delta_{\mathrm{DS},S}\|_1. \end{align*} [guided] The Dantzig selector is controlled by feasibility rather than by a quadratic optimality inequality. First we verify that the true vector $\beta^*$ is allowed in the Dantzig optimisation problem. Since $y=X\beta^*+\varepsilon$, the residual at $\beta^*$ is \begin{align*} y-X\beta^*=\varepsilon. \end{align*} Therefore \begin{align*} \left\|\frac{X^\top(y-X\beta^*)}{n}\right\|_\infty = \left\|\frac{X^\top\varepsilon}{n}\right\|_\infty \leq \frac{\lambda}{2} \leq \lambda. \end{align*} Thus $\beta^*$ belongs to the feasible set in the definition of $\hat\beta_{\mathrm{DS}}$. Because $\hat\beta_{\mathrm{DS}}$ minimises $\|\beta\|_1$ over that feasible set, its $\ell^1$ norm is no larger than the $\ell^1$ norm of the feasible point $\beta^*$. With \begin{align*} \Delta_{\mathrm{DS}} := \hat\beta_{\mathrm{DS}}-\beta^*, \end{align*} this says \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1. \end{align*} Now we use sparsity. Since $S=\operatorname{supp}(\beta^*)$, the vector $\beta^*$ vanishes on $S^c$. Hence \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 = \|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1 + \|\Delta_{\mathrm{DS},S^c}\|_1. \end{align*} The reverse triangle inequality on the coordinates in $S$ gives \begin{align*} \|\beta^*_S+\Delta_{\mathrm{DS},S}\|_1 \geq \|\beta^*_S\|_1-\|\Delta_{\mathrm{DS},S}\|_1. \end{align*} Therefore \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 \geq \|\beta^*_S\|_1-\|\Delta_{\mathrm{DS},S}\|_1+\|\Delta_{\mathrm{DS},S^c}\|_1. \end{align*} Combining this with \begin{align*} \|\beta^*+\Delta_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1=\|\beta^*_S\|_1 \end{align*} and cancelling $\|\beta^*_S\|_1$ yields \begin{align*} \|\Delta_{\mathrm{DS},S^c}\|_1 \leq \|\Delta_{\mathrm{DS},S}\|_1. \end{align*} This is the sharper Dantzig cone condition: its constant is $1$ because the Dantzig argument uses direct $\ell^1$ minimality rather than the Lasso basic inequality with an additional score term. [/guided] [/step] [step:Convert Dantzig feasibility into prediction and $\ell^1$ bounds] Since both $\hat\beta_{\mathrm{DS}}$ and $\beta^*$ are feasible up to radii $\lambda$ and $\lambda/2$, respectively, The identity \begin{align*} X\hat\beta_{\mathrm{DS}}-X\beta^* = (y-X\beta^*)-(y-X\hat\beta_{\mathrm{DS}}) \end{align*} gives \begin{align*} \left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty = \left\|\frac{X^\top(y-X\beta^*)}{n} - \frac{X^\top(y-X\hat\beta_{\mathrm{DS}})}{n} \right\|_\infty. \end{align*} Applying the triangle inequality in $\ell^\infty$ and using the score event together with Dantzig feasibility of $\hat\beta_{\mathrm{DS}}$ yields \begin{align*} \left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty \leq \left\|\frac{X^\top\varepsilon}{n}\right\|_\infty + \left\|\frac{X^\top(y-X\hat\beta_{\mathrm{DS}})}{n}\right\|_\infty \leq \frac{3\lambda}{2}. \end{align*} Therefore, using Hölder's inequality for $(\ell^1,\ell^\infty)$, \begin{align*} \frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n} = \Delta_{\mathrm{DS}}^\top\frac{X^\top X\Delta_{\mathrm{DS}}}{n} \leq \|\Delta_{\mathrm{DS}}\|_1 \left\|\frac{X^\top X\Delta_{\mathrm{DS}}}{n}\right\|_\infty \leq \frac{3\lambda}{2}\|\Delta_{\mathrm{DS}}\|_1. \end{align*} The cone condition gives \begin{align*} \|\Delta_{\mathrm{DS}}\|_1 \leq 2\|\Delta_{\mathrm{DS},S}\|_1 \leq 2\sqrt{s}\|\Delta_{\mathrm{DS},S}\|_2. \end{align*} Since $\Delta_{\mathrm{DS}}$ lies in the cone $\|\Delta_{\mathrm{DS},S^c}\|_1 \leq \|\Delta_{\mathrm{DS},S}\|_1$, the restricted eigenvalue hypothesis gives \begin{align*} \|\Delta_{\mathrm{DS},S}\|_2 \leq \frac{1}{\kappa}\frac{\|X\Delta_{\mathrm{DS}}\|_2}{\sqrt{n}}. \end{align*} Writing \begin{align*} A_{\mathrm{DS}} := \frac{\|X\Delta_{\mathrm{DS}}\|_2}{\sqrt{n}}, \end{align*} we obtain \begin{align*} A_{\mathrm{DS}}^2 \leq \frac{3\lambda}{2}\cdot 2\sqrt{s}\cdot \frac{A_{\mathrm{DS}}}{\kappa} = \frac{3\lambda\sqrt{s}}{\kappa}A_{\mathrm{DS}}. \end{align*} If $A_{\mathrm{DS}}=0$, the prediction bound is immediate. If $A_{\mathrm{DS}}>0$, division by $A_{\mathrm{DS}}$ gives \begin{align*} A_{\mathrm{DS}} \leq \frac{3\lambda\sqrt{s}}{\kappa}, \end{align*} so \begin{align*} \frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n} = A_{\mathrm{DS}}^2 \leq \frac{9\lambda^2s}{\kappa^2}. \end{align*} The $\ell^1$ estimate follows from the preceding cone and restricted eigenvalue bounds: \begin{align*} \|\Delta_{\mathrm{DS}}\|_1 \leq \frac{2\sqrt{s}}{\kappa}A_{\mathrm{DS}} \leq \frac{6\lambda s}{\kappa^2} \leq \frac{12\lambda s}{\kappa^2}. \end{align*} [/step] [step:Combine the two estimator-specific estimates] For the Lasso error $\Delta_{\mathrm{L}}=\hat\beta_{\mathrm{L}}-\beta^*$, we proved \begin{align*} \frac{\|X\Delta_{\mathrm{L}}\|_2^2}{n} \leq \frac{9\lambda^2s}{\kappa^2}, \qquad \|\Delta_{\mathrm{L}}\|_1 \leq \frac{12\lambda s}{\kappa^2}. \end{align*} For the Dantzig selector error $\Delta_{\mathrm{DS}}=\hat\beta_{\mathrm{DS}}-\beta^*$, we proved \begin{align*} \frac{\|X\Delta_{\mathrm{DS}}\|_2^2}{n} \leq \frac{9\lambda^2s}{\kappa^2}, \qquad \|\Delta_{\mathrm{DS}}\|_1 \leq \frac{12\lambda s}{\kappa^2}. \end{align*} Thus the same prediction and $\ell^1$ rates hold for each $\hat\beta \in \{\hat\beta_{\mathrm{L}},\hat\beta_{\mathrm{DS}}\}$, which is the desired comparison. [/step]

Prerequisites (0/4 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Event

Explore Further

Event Definition Triangle Inequality For Inner Product Spaces Theorem #433 Reverse Triangle Inequality Theorem #2300 Lasso Basic Inequality Theorem #5555 Sum Law by Convolution Probability Theory Weak Law of Large Numbers Probability Theory Independence of Fitted Values and Residuals in the Normal Linear Model Probability & Statistics $L^2$ Integrability Implies $L^1$ Integrability Probability Theory Conditional Expectations are Uniformly Integrable Martingale Theory Likelihood-Ratio Deviance Comparison Theorem for Nested GLMs Probability & Statistics Variance Inflation Factor Formula Probability & Statistics Wald, Likelihood-Ratio, and Score Tests in Regular Generalized Linear Models Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.

Comparison of Lasso and Dantzig Selector Rates (Theorem # 5577)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Comparison of Lasso and Dantzig Selector Rates (Theorem # 5577)

Discussion

Proof

Prerequisites (0/4 completed)

Prerequisites Graph

Explore Further