Dantzig Selector Oracle Inequality under a Restricted Eigenvalue Condition

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The proof has three ingredients. First, optimality of the Dantzig selector and the support condition on $\beta^*$ force the error vector $\Delta$ into the cone $\mathcal{C}(S)$. Second, feasibility of both $\hat{\beta}_{\mathrm{DS}}$ and $\beta^*$ controls the empirical score $\|X^\top X\Delta/n\|_\infty$ by $2\lambda$. Finally, the cone condition and the restricted eigenvalue condition convert this score bound into a prediction estimate, and substituting that estimate back gives the $\ell^1$ estimate. [/proofplan] [step:Use Dantzig optimality to place the error in the cone] Since $\beta^* \in \mathcal{D}_\lambda$ and $\hat{\beta}_{\mathrm{DS}}$ minimizes the $\ell^1$ norm over $\mathcal{D}_\lambda$, we have \begin{align*} \|\hat{\beta}_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1. \end{align*} Using $\hat{\beta}_{\mathrm{DS}} = \beta^* + \Delta$ and $\beta^*_{S^c}=0$, this gives \begin{align*} \|\beta^*_S + \Delta_S\|_1 + \|\Delta_{S^c}\|_1 = \|\beta^* + \Delta\|_1 \leq \|\beta^*\|_1 = \|\beta^*_S\|_1. \end{align*} By the [Reverse Triangle Inequality](/page/Reverse%20Triangle%20Inequality) for the $\ell^1$ norm on $\mathbb{R}^p$, \begin{align*} \|\beta^*_S + \Delta_S\|_1 \geq \|\beta^*_S\|_1 - \|\Delta_S\|_1. \end{align*} Combining the last two displays yields \begin{align*} \|\Delta_{S^c}\|_1 \leq \|\Delta_S\|_1. \end{align*} Thus $\Delta \in \mathcal{C}(S)$. If $s=0$, then $S=\varnothing$, so the cone condition gives \begin{align*} \|\Delta\|_1=\|\Delta_{S^c}\|_1 \leq \|\Delta_S\|_1=0. \end{align*} Hence $\Delta=0$, and both asserted inequalities hold because $\kappa(S,1)>0$ makes the displayed denominators positive and the right-hand sides equal $0$. For the remaining steps, assume $s\geq 1$. [guided] The first point is that the Dantzig selector is not merely feasible; it is an $\ell^1$-minimizer among all feasible vectors. Since $\beta^*$ is feasible by hypothesis, the minimizing property gives \begin{align*} \|\hat{\beta}_{\mathrm{DS}}\|_1 \leq \|\beta^*\|_1. \end{align*} We now rewrite this inequality in terms of the error vector $\Delta := \hat{\beta}_{\mathrm{DS}}-\beta^*$. Because $S=\operatorname{supp}(\beta^*)$, the vector $\beta^*$ has no coordinates outside $S$, so $\beta^*_{S^c}=0$. Therefore \begin{align*} \|\hat{\beta}_{\mathrm{DS}}\|_1 = \|\beta^*+\Delta\|_1 = \|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1. \end{align*} The right-hand side of the optimality inequality is \begin{align*} \|\beta^*\|_1=\|\beta^*_S\|_1. \end{align*} Hence \begin{align*} \|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1 \leq \|\beta^*_S\|_1. \end{align*} The only remaining issue is to isolate $\|\Delta_{S^c}\|_1$. The [Reverse Triangle Inequality](/page/Reverse%20Triangle%20Inequality) for the $\ell^1$ norm on $\mathbb{R}^p$ gives \begin{align*} \|\beta^*_S+\Delta_S\|_1 \geq \|\beta^*_S\|_1-\|\Delta_S\|_1. \end{align*} Substituting this lower bound into the previous inequality gives \begin{align*} \|\beta^*_S\|_1-\|\Delta_S\|_1+\|\Delta_{S^c}\|_1 \leq \|\beta^*_S\|_1. \end{align*} After subtracting $\|\beta^*_S\|_1$ from both sides, we obtain \begin{align*} \|\Delta_{S^c}\|_1 \leq \|\Delta_S\|_1. \end{align*} This is exactly the cone condition $\Delta \in \mathcal{C}(S)$. We also handle the edge case in which the support is empty. If $s=0$, then $S=\varnothing$, and the cone condition reads \begin{align*} \|\Delta\|_1=\|\Delta_{S^c}\|_1 \leq \|\Delta_S\|_1=0. \end{align*} Therefore $\Delta=0$. Since $\kappa(S,1)>0$, the denominators in the two claimed right-hand sides are positive; since $s=0$, both right-hand sides are $0$. Thus both desired inequalities hold in this case. From now on, we may assume $s\geq 1$. [/guided] [/step] [step:Use feasibility of both vectors to bound the empirical score] Define the residual map $R: \mathbb{R}^p \to \mathbb{R}^n$ by declaring that, for each $\beta \in \mathbb{R}^p$, the value $R(\beta) \in \mathbb{R}^n$ is \begin{align*} R(\beta)=y-X\beta. \end{align*} Since $\hat{\beta}_{\mathrm{DS}},\beta^* \in \mathcal{D}_\lambda$, we have \begin{align*} \left\|\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}\right\|_\infty \leq \lambda, \qquad \left\|\frac{X^\top R(\beta^*)}{n}\right\|_\infty \leq \lambda. \end{align*} Because \begin{align*} R(\beta^*)-R(\hat{\beta}_{\mathrm{DS}}) = y-X\beta^* - y + X\hat{\beta}_{\mathrm{DS}} = X\Delta, \end{align*} we first obtain \begin{align*} \left\|\frac{X^\top X\Delta}{n}\right\|_\infty = \left\|\frac{X^\top R(\beta^*)}{n}-\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}\right\|_\infty. \end{align*} Applying the triangle inequality for the maximum norm $\|\cdot\|_\infty$ on $\mathbb{R}^p$ gives \begin{align*} \left\|\frac{X^\top X\Delta}{n}\right\|_\infty \leq \left\|\frac{X^\top R(\beta^*)}{n}\right\|_\infty + \left\|\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}\right\|_\infty. \end{align*} Using the two feasibility bounds yields \begin{align*} \left\|\frac{X^\top X\Delta}{n}\right\|_\infty \leq 2\lambda. \end{align*} [guided] We next translate feasibility into a bound on the empirical score of the error vector. Define the residual map $R: \mathbb{R}^p \to \mathbb{R}^n$ by declaring that, for each $\beta \in \mathbb{R}^p$, the value $R(\beta) \in \mathbb{R}^n$ is \begin{align*} R(\beta)=y-X\beta. \end{align*} The definition of feasibility in the Dantzig set $\mathcal{D}_\lambda$ gives the two bounds \begin{align*} \left\|\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}\right\|_\infty \leq \lambda, \qquad \left\|\frac{X^\top R(\beta^*)}{n}\right\|_\infty \leq \lambda. \end{align*} The key point is that the difference of these residuals is exactly the fitted-value error. Since $\Delta=\hat{\beta}_{\mathrm{DS}}-\beta^*$, expanding the two residuals gives \begin{align*} R(\beta^*)-R(\hat{\beta}_{\mathrm{DS}})=y-X\beta^* - y + X\hat{\beta}_{\mathrm{DS}}. \end{align*} Cancelling the two occurrences of $y$ and factoring out $X$ gives \begin{align*} R(\beta^*)-R(\hat{\beta}_{\mathrm{DS}})=X(\hat{\beta}_{\mathrm{DS}}-\beta^*)=X\Delta. \end{align*} Multiplying by $X^\top/n$ gives \begin{align*} \frac{X^\top X\Delta}{n} = \frac{X^\top R(\beta^*)}{n}-\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}. \end{align*} Applying the triangle inequality for the maximum norm $\|\cdot\|_\infty$ on $\mathbb{R}^p$ gives \begin{align*} \left\|\frac{X^\top X\Delta}{n}\right\|_\infty \leq \left\|\frac{X^\top R(\beta^*)}{n}\right\|_\infty + \left\|\frac{X^\top R(\hat{\beta}_{\mathrm{DS}})}{n}\right\|_\infty. \end{align*} Substituting the two feasibility estimates yields \begin{align*} \left\|\frac{X^\top X\Delta}{n}\right\|_\infty \leq \lambda+\lambda=2\lambda. \end{align*} [/guided] [/step] [step:Convert the score bound into a prediction bound] The Euclidean identity for the matrix $X$ and vector $\Delta$ gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} = \frac{\Delta^\top X^\top X\Delta}{n}. \end{align*} Applying [Hölder's Inequality](/page/Holder%20Inequality) for the conjugate pair $(1,\infty)$ to the vectors $\Delta$ and $X^\top X\Delta/n$ in $\mathbb{R}^p$, and then using the score bound, yields \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq \|\Delta\|_1\left\|\frac{X^\top X\Delta}{n}\right\|_\infty \leq 2\lambda\|\Delta\|_1. \end{align*} Since $\Delta \in \mathcal{C}(S)$, \begin{align*} \|\Delta\|_1 = \|\Delta_S\|_1+\|\Delta_{S^c}\|_1 \leq 2\|\Delta_S\|_1. \end{align*} Enumerate the finite set $S$ as $S=\{j_1,\dots,j_s\}$, and regard $\Delta_S$ as the vector $(\Delta_{j_1},\dots,\Delta_{j_s}) \in \mathbb{R}^s$. By the [Cauchy-Schwarz Inequality](/page/Cauchy-Schwarz%20Inequality) in $\mathbb{R}^s$, \begin{align*} \|\Delta_S\|_1 \leq \sqrt{s}\,\|\Delta_S\|_2. \end{align*} Thus \begin{align*} \|\Delta\|_1 \leq 2\sqrt{s}\,\|\Delta_S\|_2. \end{align*} If $\Delta_S=0$, the cone condition gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and both claimed inequalities follow. Assume therefore that $\Delta_S \neq 0$. Since $\Delta \in \mathcal{C}(S)$, the definition of $\kappa(S,1)$ gives \begin{align*} \kappa(S,1) \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\|\Delta_S\|_2}, \end{align*} or equivalently \begin{align*} \|\Delta_S\|_2 \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} Combining the preceding estimates gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq 2\lambda\|\Delta\|_1 \leq 4\lambda\sqrt{s}\,\|\Delta_S\|_2 \leq \frac{4\lambda\sqrt{s}\,\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} If $\|X\Delta\|_2=0$, the desired prediction bound is immediate. If $\|X\Delta\|_2>0$, divide by $\|X\Delta\|_2/\sqrt{n}$ to obtain \begin{align*} \frac{\|X\Delta\|_2}{\sqrt{n}} \leq \frac{4\lambda\sqrt{s}}{\kappa(S,1)}. \end{align*} Squaring both sides gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq \frac{16\lambda^2 s}{\kappa(S,1)^2}. \end{align*} [guided] We now combine the score control with the restricted eigenvalue condition. First, the Euclidean identity for the matrix $X$ and vector $\Delta$ gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} = \frac{\Delta^\top X^\top X\Delta}{n}. \end{align*} Apply [Hölder's Inequality](/page/Holder%20Inequality) with conjugate exponents $(1,\infty)$ to the two vectors $\Delta \in \mathbb{R}^p$ and $X^\top X\Delta/n \in \mathbb{R}^p$. This gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq \|\Delta\|_1\left\|\frac{X^\top X\Delta}{n}\right\|_\infty. \end{align*} Using the score bound from the previous step, we obtain \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq 2\lambda\|\Delta\|_1. \end{align*} The cone condition now reduces the full $\ell^1$ norm to the active coordinates: \begin{align*} \|\Delta\|_1 = \|\Delta_S\|_1+\|\Delta_{S^c}\|_1 \leq 2\|\Delta_S\|_1. \end{align*} Enumerate the finite set $S$ as $S=\{j_1,\dots,j_s\}$, and regard $\Delta_S$ as the vector $(\Delta_{j_1},\dots,\Delta_{j_s}) \in \mathbb{R}^s$. Applying the [Cauchy-Schwarz Inequality](/page/Cauchy-Schwarz%20Inequality) in $\mathbb{R}^s$ gives \begin{align*} \|\Delta_S\|_1 \leq \sqrt{s}\,\|\Delta_S\|_2. \end{align*} Therefore \begin{align*} \|\Delta\|_1 \leq 2\sqrt{s}\,\|\Delta_S\|_2. \end{align*} If $\Delta_S=0$, the cone condition gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and the prediction inequality follows. Assume $\Delta_S\neq 0$. Since $\Delta\in\mathcal{C}(S)$, the definition of the restricted eigenvalue constant $\kappa(S,1)$ gives \begin{align*} \kappa(S,1) \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\|\Delta_S\|_2}, \end{align*} so \begin{align*} \|\Delta_S\|_2 \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} Combining the estimates yields \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq 2\lambda\|\Delta\|_1 \leq 4\lambda\sqrt{s}\,\|\Delta_S\|_2 \leq \frac{4\lambda\sqrt{s}\,\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} If $\|X\Delta\|_2=0$, the desired prediction bound follows. If $\|X\Delta\|_2>0$, divide by $\|X\Delta\|_2/\sqrt{n}$ to obtain \begin{align*} \frac{\|X\Delta\|_2}{\sqrt{n}} \leq \frac{4\lambda\sqrt{s}}{\kappa(S,1)}. \end{align*} Squaring both sides gives \begin{align*} \frac{\|X\Delta\|_2^2}{n} \leq \frac{16\lambda^2 s}{\kappa(S,1)^2}. \end{align*} [/guided] [/step] [step:Substitute the prediction estimate to obtain the $\ell^1$ estimate] From the cone condition and the finite-dimensional Cauchy-Schwarz estimate on the support coordinates, \begin{align*} \|\Delta\|_1 \leq 2\sqrt{s}\,\|\Delta_S\|_2. \end{align*} If $\Delta_S=0$, then the cone condition gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and the claimed $\ell^1$ estimate follows. Assume therefore that $\Delta_S\neq 0$. Since $\Delta\in\mathcal{C}(S)$, the definition of $\kappa(S,1)$ gives \begin{align*} \|\Delta_S\|_2 \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} Combining the last two displays gives \begin{align*} \|\Delta\|_1 \leq \frac{2\sqrt{s}\,\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} Using the prediction estimate \begin{align*} \frac{\|X\Delta\|_2}{\sqrt{n}} \leq \frac{4\lambda\sqrt{s}}{\kappa(S,1)}, \end{align*} we obtain \begin{align*} \|\Delta\|_1 \leq \frac{2\sqrt{s}}{\kappa(S,1)} \cdot \frac{4\lambda\sqrt{s}}{\kappa(S,1)} = \frac{8\lambda s}{\kappa(S,1)^2}. \end{align*} Together with the prediction bound, this proves both asserted inequalities. [guided] The prediction estimate already proved can now be fed back into the cone and restricted eigenvalue estimates to control $\|\Delta\|_1$. From the cone condition and the finite-dimensional Cauchy-Schwarz estimate on the support coordinates, we have \begin{align*} \|\Delta\|_1 \leq 2\sqrt{s}\,\|\Delta_S\|_2. \end{align*} If $\Delta_S=0$, then this already gives $\|\Delta\|_1=0$, so the desired $\ell^1$ bound holds. If $\Delta_S\neq 0$, the restricted eigenvalue estimate gives \begin{align*} \|\Delta_S\|_2 \leq \frac{\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} Therefore \begin{align*} \|\Delta\|_1 \leq \frac{2\sqrt{s}\,\|X\Delta\|_2}{\sqrt{n}\,\kappa(S,1)}. \end{align*} The prediction bound gives \begin{align*} \frac{\|X\Delta\|_2}{\sqrt{n}} \leq \frac{4\lambda\sqrt{s}}{\kappa(S,1)}. \end{align*} Substituting this into the preceding display yields \begin{align*} \|\Delta\|_1 \leq \frac{2\sqrt{s}}{\kappa(S,1)} \cdot \frac{4\lambda\sqrt{s}}{\kappa(S,1)} = \frac{8\lambda s}{\kappa(S,1)^2}. \end{align*} Together with the prediction bound, this proves both asserted inequalities. [/guided] [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

What brings you to Androma?

Start with a route through the knowledge graph.