Fast Lasso Prediction Bound under the Compatibility Condition

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] Set $\Delta := \hat{\beta}-\beta^*$ and use the optimality of $\hat{\beta}$ in the Lasso objective to obtain the basic inequality. On the score event, the stochastic [inner product](/page/Inner%20Product) is controlled by $\frac{\lambda}{2}\|\Delta\|_1$, and the support decomposition of $\beta^*$ sharpens the inequality into a cone inequality and a prediction-error inequality. The cone condition permits the compatibility constant to control $\|\Delta_S\|_1$ by $|X\Delta|/\sqrt n$, and solving the resulting scalar quadratic inequality gives the desired fast prediction bound. [/proofplan] [step:Derive the sharpened basic inequality from Lasso optimality] Define the estimation error vector $\Delta \in \mathbb{R}^p$ by \begin{align*} \Delta := \hat{\beta}-\beta^*. \end{align*} Throughout the proof, $|\cdot|$ denotes the Euclidean norm on the relevant finite-dimensional Euclidean space, and $\|\cdot\|_1$ and $\|\cdot\|_\infty$ denote the usual $\ell^1$ and $\ell^\infty$ norms on $\mathbb{R}^p$. Since $\hat{\beta}$ minimizes the Lasso objective over $\mathbb{R}^p$, comparison with $\beta^*$ gives \begin{align*} \frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1. \end{align*} Using $Y=X\beta^*+\varepsilon$ and $X\hat{\beta}=X\beta^*+X\Delta$, this becomes \begin{align*} \frac{1}{2n}|\varepsilon-X\Delta|^2+\lambda\|\beta^*+\Delta\|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1. \end{align*} Expanding the squared Euclidean norm, \begin{align*} |\varepsilon-X\Delta|^2 = |\varepsilon|^2-2\varepsilon^\top X\Delta+|X\Delta|^2, \end{align*} and cancelling $\frac{1}{2n}|\varepsilon|^2$ yields \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{1}{n}\varepsilon^\top X\Delta + \lambda\bigl(\|\beta^*\|_1-\|\beta^*+\Delta\|_1\bigr). \end{align*} On $\mathcal{E}_\lambda$, the finite-dimensional dual norm inequality for the pair $(\ell^\infty,\ell^1)$ gives \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta = \left(\frac{1}{n}X^\top\varepsilon\right)^\top \Delta \leq \frac{1}{n}\|X^\top\varepsilon\|_\infty\|\Delta\|_1 \leq \frac{\lambda}{2}\|\Delta\|_1. \end{align*} Since $\beta^*_{S^c}=0$, decomposing the $\ell^1$ norm over $S$ and $S^c$ gives \begin{align*} \|\beta^*\|_1-\|\beta^*+\Delta\|_1 = \|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1-\|\Delta_{S^c}\|_1. \end{align*} The [reverse triangle inequality](/theorems/2300) in $\ell^1(S)$ gives $\|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1 \leq \|\Delta_S\|_1$, hence \begin{align*} \|\beta^*\|_1-\|\beta^*+\Delta\|_1 \leq \|\Delta_S\|_1-\|\Delta_{S^c}\|_1. \end{align*} Combining these estimates and using $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the combined right-hand side is \begin{align*} \frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr) + \lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr) = \frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Therefore \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Equivalently, \begin{align*} \frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} [guided] The proof starts from the only property of $\hat{\beta}$ that is needed: it has objective value no larger than the value at the true coefficient vector $\beta^*$. Define the error vector $\Delta \in \mathbb{R}^p$ by \begin{align*} \Delta := \hat{\beta}-\beta^*. \end{align*} In this proof, $|\cdot|$ denotes the Euclidean norm on the relevant finite-dimensional Euclidean space, while $\|\cdot\|_1$ and $\|\cdot\|_\infty$ denote the usual $\ell^1$ and $\ell^\infty$ norms on $\mathbb{R}^p$. Thus $|Y-X\hat{\beta}|$ and $|X\Delta|$ are Euclidean prediction norms, whereas $\|\hat{\beta}\|_1$ is the Lasso penalty norm. The Lasso optimality condition gives \begin{align*} \frac{1}{2n}|Y-X\hat{\beta}|^2+\lambda\|\hat{\beta}\|_1 \leq \frac{1}{2n}|Y-X\beta^*|^2+\lambda\|\beta^*\|_1. \end{align*} Because the model is $Y=X\beta^*+\varepsilon$, the two residuals are \begin{align*} Y-X\hat{\beta} = \varepsilon-X\Delta, \qquad Y-X\beta^* = \varepsilon. \end{align*} Substituting these identities into the optimality inequality gives \begin{align*} \frac{1}{2n}|\varepsilon-X\Delta|^2+\lambda\|\beta^*+\Delta\|_1 \leq \frac{1}{2n}|\varepsilon|^2+\lambda\|\beta^*\|_1. \end{align*} Now expand the squared Euclidean norm: \begin{align*} |\varepsilon-X\Delta|^2 = |\varepsilon|^2-2\varepsilon^\top X\Delta+|X\Delta|^2. \end{align*} After cancelling $\frac{1}{2n}|\varepsilon|^2$ from both sides, we obtain \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{1}{n}\varepsilon^\top X\Delta + \lambda\bigl(\|\beta^*\|_1-\|\beta^*+\Delta\|_1\bigr). \end{align*} The score event is designed exactly to control the stochastic inner product. The finite-dimensional [dual norm](/page/Dual%20Norm) inequality says that for vectors $a,b \in \mathbb{R}^p$, $a^\top b \leq \|a\|_\infty\|b\|_1$. Applying this with $a=\frac{1}{n}X^\top\varepsilon$ and $b=\Delta$ gives \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta = \left(\frac{1}{n}X^\top\varepsilon\right)^\top\Delta \leq \frac{1}{n}\|X^\top\varepsilon\|_\infty\|\Delta\|_1. \end{align*} On $\mathcal{E}_\lambda$, the factor $\frac{1}{n}\|X^\top\varepsilon\|_\infty$ is at most $\lambda/2$, hence \begin{align*} \frac{1}{n}\varepsilon^\top X\Delta \leq \frac{\lambda}{2}\|\Delta\|_1. \end{align*} It remains to exploit the sparsity of $\beta^*$. Since $S=\operatorname{supp}(\beta^*)$, we have $\beta^*_{S^c}=0$. Therefore \begin{align*} \|\beta^*+\Delta\|_1 = \|\beta^*_S+\Delta_S\|_1+\|\Delta_{S^c}\|_1. \end{align*} The [reverse triangle inequality](/page/Triangle%20Inequality) in $\ell^1(S)$ gives \begin{align*} \|\beta^*_S+\Delta_S\|_1 \geq \|\beta^*_S\|_1-\|\Delta_S\|_1. \end{align*} Therefore \begin{align*} \|\beta^*_S\|_1-\|\beta^*_S+\Delta_S\|_1 \leq \|\Delta_S\|_1. \end{align*} Combining this with the support decomposition gives \begin{align*} \|\beta^*\|_1-\|\beta^*+\Delta\|_1 \leq \|\Delta_S\|_1-\|\Delta_{S^c}\|_1. \end{align*} Substituting the stochastic bound and this support decomposition into the basic inequality gives the next estimate. First, using $\|\Delta\|_1=\|\Delta_S\|_1+\|\Delta_{S^c}\|_1$, the right-hand side becomes \begin{align*} \frac{\lambda}{2}\|\Delta\|_1 + \lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr) = \frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr) + \lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr). \end{align*} Collecting the coefficients of $\|\Delta_S\|_1$ and $\|\Delta_{S^c}\|_1$ gives \begin{align*} \frac{\lambda}{2}\bigl(\|\Delta_S\|_1+\|\Delta_{S^c}\|_1\bigr) + \lambda\bigl(\|\Delta_S\|_1-\|\Delta_{S^c}\|_1\bigr) = \frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Hence \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1-\frac{\lambda}{2}\|\Delta_{S^c}\|_1. \end{align*} Moving the non-positive term on the right to the left gives the sharpened basic inequality \begin{align*} \frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} This inequality is the point where the noise control and the sparsity of $\beta^*$ meet. [/guided] [/step] [step:Use the sharpened inequality to put the error in the compatibility cone] From \begin{align*} \frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1 \end{align*} and the non-negativity of $\frac{1}{2n}|X\Delta|^2$, we obtain \begin{align*} \|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1. \end{align*} Thus $\Delta$ belongs to the cone appearing in the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$ whenever $\Delta_S \neq 0$. If $\Delta_S=0$, the same inequality gives $\|\Delta_{S^c}\|_1=0$, hence $\Delta=0$, and the desired prediction bound follows immediately. Therefore, for the remaining argument, assume $\Delta_S\neq 0$. [guided] Starting from the sharpened basic inequality, \begin{align*} \frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1, \end{align*} we isolate the part that forces the Lasso error into the compatibility cone. The term $\frac{1}{2n}|X\Delta|^2$ is non-negative because it is a scalar multiple of a squared Euclidean norm. Dropping this non-negative term from the left-hand side gives the weaker inequality \begin{align*} \frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} Since $\lambda>0$ by hypothesis, division by $\lambda/2$ gives \begin{align*} \|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1. \end{align*} This is exactly the cone condition appearing in the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$, provided $\Delta_S\neq 0$. There is one degenerate case to remove before applying compatibility. If $\Delta_S=0$, then the cone inequality becomes \begin{align*} \|\Delta_{S^c}\|_1 \leq 0. \end{align*} Because an $\ell^1$ norm is non-negative, this implies $\|\Delta_{S^c}\|_1=0$, hence $\Delta_{S^c}=0$. Together with $\Delta_S=0$, this gives $\Delta=0$. Therefore $X(\hat\beta-\beta^*)=X\Delta=0$, and the theorem's prediction bound follows because the right-hand side is non-negative. Thus, in the remaining case, we may assume $\Delta_S\neq 0$ and $\Delta$ lies in the compatibility cone. [/guided] [/step] [step:Apply the compatibility constant to control the active coordinates] Since $\Delta_S\neq 0$ and $\|\Delta_{S^c}\|_1\leq 3\|\Delta_S\|_1$, the definition of the [compatibility constant](/page/Compatibility%20Constant) $\phi_{\mathrm{comp}}(S,3)$ gives \begin{align*} \phi_{\mathrm{comp}}(S,3) \leq \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\|\Delta_S\|_1}. \end{align*} Because $\phi_{\mathrm{comp}}(S,3)>0$, rearranging yields \begin{align*} \|\Delta_S\|_1 \leq \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}. \end{align*} Discarding the non-negative term $\frac{\lambda}{2}\|\Delta_{S^c}\|_1$ from the sharpened basic inequality gives \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} Substituting the compatibility bound gives \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2} \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}. \end{align*} [guided] Now the purpose of the cone condition becomes visible: it allows the [compatibility constant](/page/Compatibility%20Constant) to convert the active-coordinate quantity $\|\Delta_S\|_1$ into the prediction norm $|X\Delta|/\sqrt n$. We have already verified the two hypotheses needed to use the definition: first, $\Delta_S\neq 0$; second, \begin{align*} \|\Delta_{S^c}\|_1 \leq 3\|\Delta_S\|_1. \end{align*} Therefore the definition of $\phi_{\mathrm{comp}}(S,3)$ gives \begin{align*} \phi_{\mathrm{comp}}(S,3) \leq \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\|\Delta_S\|_1}. \end{align*} The theorem assumes $\phi_{\mathrm{comp}}(S,3)>0$, and $\|\Delta_S\|_1>0$ follows from $\Delta_S\neq 0$. Hence we may multiply by $\|\Delta_S\|_1$ and divide by $\phi_{\mathrm{comp}}(S,3)$ to obtain \begin{align*} \|\Delta_S\|_1 \leq \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}. \end{align*} Return to the sharpened basic inequality, \begin{align*} \frac{1}{2n}|X\Delta|^2+\frac{\lambda}{2}\|\Delta_{S^c}\|_1 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} Since $\frac{\lambda}{2}\|\Delta_{S^c}\|_1\geq 0$, discarding it from the left-hand side gives the weaker estimate \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2}\|\Delta_S\|_1. \end{align*} Substituting the compatibility bound for $\|\Delta_S\|_1$ yields \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2} \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}. \end{align*} This reduces the proof to a scalar inequality for the normalized prediction error. [/guided] [/step] [step:Solve the scalar inequality and square the prediction norm] Define the scalar $u \in [0,\infty)$ by \begin{align*} u := \frac{|X\Delta|}{\sqrt{n}}. \end{align*} The preceding inequality becomes \begin{align*} \frac{u^2}{2} \leq \frac{3\lambda\sqrt{s}}{2\phi_{\mathrm{comp}}(S,3)}u. \end{align*} If $u=0$, the desired bound is immediate. If $u>0$, divide both sides by $u/2$ to obtain \begin{align*} u \leq \frac{3\lambda\sqrt{s}}{\phi_{\mathrm{comp}}(S,3)}. \end{align*} Squaring both sides and using $\Delta=\hat{\beta}-\beta^*$ gives \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 = u^2 \leq \frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}. \end{align*} This is the claimed fast prediction bound. [guided] Define the normalized prediction-error scalar $u \in [0,\infty)$ by \begin{align*} u := \frac{|X\Delta|}{\sqrt{n}}. \end{align*} The previous step proved \begin{align*} \frac{1}{2n}|X\Delta|^2 \leq \frac{3\lambda}{2} \frac{\sqrt{s}|X\Delta|}{\sqrt{n}\phi_{\mathrm{comp}}(S,3)}. \end{align*} Using the definition of $u$, this is exactly \begin{align*} \frac{u^2}{2} \leq \frac{3\lambda\sqrt{s}}{2\phi_{\mathrm{comp}}(S,3)}u. \end{align*} If $u=0$, then \begin{align*} \frac{1}{n}|X(\hat\beta-\beta^*)|^2 = u^2 = 0, \end{align*} and the desired inequality follows because $9\lambda^2s/\phi_{\mathrm{comp}}(S,3)^2\geq 0$. Assume instead that $u>0$. Then $u/2>0$, so we may divide both sides of the scalar inequality by $u/2$ and obtain \begin{align*} u \leq \frac{3\lambda\sqrt{s}}{\phi_{\mathrm{comp}}(S,3)}. \end{align*} Both sides are non-negative, so squaring preserves the inequality: \begin{align*} u^2 \leq \frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}. \end{align*} Finally, since $\Delta=\hat\beta-\beta^*$, the definition of $u$ gives \begin{align*} \frac{1}{n}|X(\hat\beta-\beta^*)|^2 = \frac{1}{n}|X\Delta|^2 = u^2 \leq \frac{9\lambda^2s}{\phi_{\mathrm{comp}}(S,3)^2}. \end{align*} This proves the theorem in guided mode as well as in exact mode. [/guided] [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.