Oracle Inequality for the Lasso under a Uniform Compatibility Condition

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We compare the Lasso solution $\hat{\beta}$ with an arbitrary sparse vector $\beta$, rather than only with the true parameter $\beta^*$. The optimality inequality gives a deterministic bound involving the stochastic score term and the penalty difference. On the event $\mathcal{E}_\lambda$, the score term is controlled by the $\ell^1$ distance, and the penalty decomposition either already gives the desired prediction bound or places the error vector in the compatibility cone. In the cone case, the compatibility condition converts the active-coordinate $\ell^1$ error into prediction error, and elementary quadratic inequalities absorb the remaining prediction term. [/proofplan] [step:Derive the basic inequality against an arbitrary sparse comparator] Fix $\beta \in \mathbb{R}^p$ with $\|\beta\|_0 \leq m$. Define $d \in \mathbb{R}^p$ by \begin{align*} d := \hat{\beta}-\beta. \end{align*} Define $S \subseteq \{1,\dots,p\}$ by \begin{align*} S := \{j \in \{1,\dots,p\}: \beta_j \neq 0\}. \end{align*} Define $s \in \mathbb{N} \cup \{0\}$ by \begin{align*} s := |S|. \end{align*} Thus $s=\|\beta\|_0 \leq m$. Since $\hat{\beta}$ minimizes the Lasso objective, evaluating the objective at $\hat{\beta}$ and at $\beta$ gives \begin{align*} \frac{1}{n}|Y-X\hat{\beta}|^2 + 2\lambda\|\hat{\beta}\|_1 \leq \frac{1}{n}|Y-X\beta|^2 + 2\lambda\|\beta\|_1. \end{align*} Using $Y=X\beta^*+\varepsilon$, expanding both squared norms, and cancelling $\frac{1}{n}|\varepsilon|^2$, we obtain \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta) + 2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1). \end{align*} On $\mathcal{E}_\lambda$, the dual norm inequality gives \begin{align*} \left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right| = 2\left|\left(\frac{1}{n}X^\top\varepsilon\right)^\top d\right| \leq 2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty \|d\|_1 \leq \lambda\|d\|_1. \end{align*} Therefore \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \lambda\|d\|_1 + 2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1). \end{align*} [guided] We begin by comparing $\hat{\beta}$ with a completely arbitrary sparse vector $\beta$. This is the key difference between an estimation bound around $\beta^*$ and an oracle inequality: the vector $\beta$ is allowed to approximate $\beta^*$, so the final estimate will contain the approximation term \begin{align*} \frac{1}{n}|X(\beta-\beta^*)|^2. \end{align*} Define $d \in \mathbb{R}^p$ by \begin{align*} d := \hat{\beta}-\beta. \end{align*} Define $S \subseteq \{1,\dots,p\}$ by \begin{align*} S := \{j \in \{1,\dots,p\}: \beta_j \neq 0\}. \end{align*} Define $s \in \mathbb{N} \cup \{0\}$ by \begin{align*} s := |S|. \end{align*} Since $S$ is the support of $\beta$, we have $s=\|\beta\|_0 \leq m$. The theorem statement defines a Lasso solution as a minimizer of the map $Q: \mathbb{R}^p \to \mathbb{R}$ defined by \begin{align*} Q(b) := \frac{1}{n}|Y-Xb|^2 + 2\lambda\|b\|_1. \end{align*} Because $\hat{\beta}$ is a minimizer of $Q$, we have $Q(\hat{\beta}) \leq Q(\beta)$, namely \begin{align*} \frac{1}{n}|Y-X\hat{\beta}|^2 + 2\lambda\|\hat{\beta}\|_1 \leq \frac{1}{n}|Y-X\beta|^2 + 2\lambda\|\beta\|_1. \end{align*} Substitute $Y=X\beta^*+\varepsilon$. Then \begin{align*} Y-X\hat{\beta} = \varepsilon-X(\hat{\beta}-\beta^*), \qquad Y-X\beta = \varepsilon-X(\beta-\beta^*). \end{align*} Expanding the Euclidean squared norm with $h=\hat{\beta}-\beta^*$ gives \begin{align*} \frac{1}{n}|\varepsilon-Xh|^2 = \frac{1}{n}|\varepsilon|^2 - \frac{2}{n}\varepsilon^\top Xh + \frac{1}{n}|Xh|^2. \end{align*} Applying the same expansion with $h=\beta-\beta^*$ gives \begin{align*} \frac{1}{n}|\varepsilon-Xh|^2 = \frac{1}{n}|\varepsilon|^2 - \frac{2}{n}\varepsilon^\top Xh + \frac{1}{n}|Xh|^2. \end{align*} After cancelling the common term $\frac{1}{n}|\varepsilon|^2$, substituting $h=\hat{\beta}-\beta^*$ and $h=\beta-\beta^*$ back into the two displayed expansions, and moving the stochastic terms to the right-hand side, we obtain \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta) + 2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1). \end{align*} Now we use the event $\mathcal{E}_\lambda$ exactly where it is needed: it controls the random linear score. Since $d=\hat{\beta}-\beta$, we have \begin{align*} \frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta) = 2\left(\frac{1}{n}X^\top\varepsilon\right)^\top d. \end{align*} The [dual norm inequality](/page/Dual%20Norm) between $\ell^\infty$ and $\ell^1$ gives \begin{align*} \left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right| \leq 2\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty\|d\|_1. \end{align*} On $\mathcal{E}_\lambda$, the factor $\left\|\frac{1}{n}X^\top\varepsilon\right\|_\infty$ is at most $\lambda/2$, hence \begin{align*} \left|\frac{2}{n}\varepsilon^\top X(\hat{\beta}-\beta)\right| \leq \lambda\|d\|_1. \end{align*} Substituting this into the basic inequality yields \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \lambda\|d\|_1 + 2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1). \end{align*} [/guided] [/step] [step:Decompose the penalty into active and inactive coordinates] Because $\beta_{S^c}=0$, we have $d_{S^c}=\hat{\beta}_{S^c}$. Also, \begin{align*} \|\hat{\beta}\|_1 = \|\beta_S+d_S\|_1+\|d_{S^c}\|_1. \end{align*} The [reverse triangle inequality](/page/Triangle%20Inequality) for the $\ell^1$ norm gives \begin{align*} \|\beta\|_1-\|\hat{\beta}\|_1 = \|\beta_S\|_1-\|\beta_S+d_S\|_1-\|d_{S^c}\|_1 \leq \|d_S\|_1-\|d_{S^c}\|_1. \end{align*} Since $\|d\|_1=\|d_S\|_1+\|d_{S^c}\|_1$, the previous step gives \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + 3\lambda\|d_S\|_1 - \lambda\|d_{S^c}\|_1. \end{align*} [guided] The penalty term must be separated according to the support of the comparator $\beta$, because the compatibility condition only controls the coordinates in that support. Since $S=\{j:\beta_j\neq 0\}$, the restriction of $\beta$ to $S^c$ is zero, and therefore $d_{S^c}=\hat{\beta}_{S^c}$. Hence \begin{align*} \|\hat{\beta}\|_1 = \|\beta_S+d_S\|_1+ \|d_{S^c}\|_1. \end{align*} The [reverse triangle inequality](/page/Triangle%20Inequality) for the $\ell^1$ norm gives \begin{align*} \|\beta_S\|_1-\|\beta_S+d_S\|_1 \leq \|d_S\|_1. \end{align*} Using also $\|\beta\|_1=\|\beta_S\|_1$, we obtain \begin{align*} \|\beta\|_1-\|\hat{\beta}\|_1 = \|\beta_S\|_1-\|\beta_S+d_S\|_1-\|d_{S^c}\|_1 \leq \|d_S\|_1-\|d_{S^c}\|_1. \end{align*} The earlier basic inequality contains the two penalty contributions $\lambda\|d\|_1$ and $2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1)$. Since $\|d\|_1=\|d_S\|_1+\|d_{S^c}\|_1$, substitution first gives \begin{align*} \lambda\|d\|_1+2\lambda(\|\beta\|_1-\|\hat{\beta}\|_1) \leq \lambda(\|d_S\|_1+\|d_{S^c}\|_1)+2\lambda(\|d_S\|_1-\|d_{S^c}\|_1). \end{align*} Collecting the active and inactive coordinate terms gives \begin{align*} \lambda(\|d_S\|_1+\|d_{S^c}\|_1)+2\lambda(\|d_S\|_1-\|d_{S^c}\|_1) = 3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1. \end{align*} Therefore \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + 3\lambda\|d_S\|_1 - \lambda\|d_{S^c}\|_1. \end{align*} This is the point where sparsity enters: the active coordinates produce a positive error term, while the inactive coordinates produce a negative term that can either settle the proof immediately or force a cone condition. [/guided] [/step] [step:Split according to whether the error lies in the compatibility cone] If \begin{align*} \|d_{S^c}\|_1 > 3\|d_S\|_1, \end{align*} then \begin{align*} 3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1 < 0, \end{align*} and hence \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2. \end{align*} This is already stronger than the desired estimate. It remains to consider the case \begin{align*} \|d_{S^c}\|_1 \leq 3\|d_S\|_1. \end{align*} If $s=0$, then $S=\varnothing$, so $d_S=0$ and the cone condition forces $d_{S^c}=0$, hence $d=0$ and the desired estimate follows. Assume therefore that $1 \leq s \leq m$. [guided] From the penalty decomposition we have \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + 3\lambda\|d_S\|_1 - \lambda\|d_{S^c}\|_1. \end{align*} There are two possibilities. If \begin{align*} \|d_{S^c}\|_1 > 3\|d_S\|_1, \end{align*} then the penalty remainder is strictly negative: \begin{align*} 3\lambda\|d_S\|_1-\lambda\|d_{S^c}\|_1 < 0. \end{align*} Consequently \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2, \end{align*} which is stronger than the desired oracle bound because the additional term $\lambda^2\|\beta\|_0/\phi^2$ is nonnegative. The only remaining case is therefore \begin{align*} \|d_{S^c}\|_1 \leq 3\|d_S\|_1. \end{align*} This is exactly the compatibility cone. If $s=0$, then $S=\varnothing$, so $d_S=0$. The cone inequality then gives $\|d_{S^c}\|_1\leq 0$, hence $d_{S^c}=0$ and $d=0$. Thus the proof is complete in the zero-support case. We may therefore assume $1\leq s\leq m$, which is precisely the range in which the uniform compatibility hypothesis can be applied to $S$. [/guided] [/step] [step:Use compatibility to convert active \ell^1 error into prediction error] Since $1 \leq s \leq m$ and $d$ satisfies the cone condition \begin{align*} \|d_{S^c}\|_1 \leq 3\|d_S\|_1, \end{align*} we are in the cone convention used in the [compatibility condition](/page/Compatibility%20Condition), namely cone constant $3$. Applying that condition to the set $S$ and the vector $d$ gives \begin{align*} \frac{1}{n}|Xd|^2 \geq \frac{\phi^2}{s}\|d_S\|_1^2. \end{align*} Therefore \begin{align*} \|d_S\|_1 \leq \frac{\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}. \end{align*} Substituting this into the penalty bound and dropping the nonpositive term $-\lambda\|d_{S^c}\|_1$ gives \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}. \end{align*} [guided] In the remaining case the error vector $d$ lies in the cone \begin{align*} \|d_{S^c}\|_1 \leq 3\|d_S\|_1. \end{align*} The set $S$ has size $s$ with $1\leq s\leq m$, so it is one of the supports covered by the uniform compatibility assumption. We use the cone-constant-$3$ convention in the [compatibility condition](/page/Compatibility%20Condition): it applies to vectors satisfying $\|d_{S^c}\|_1\leq 3\|d_S\|_1$. Since this is exactly the cone inequality above, applying the compatibility condition to this support $S$ and this vector $d$ gives \begin{align*} \frac{1}{n}|Xd|^2 \geq \frac{\phi^2}{s}\|d_S\|_1^2. \end{align*} Taking square roots is valid because both sides are nonnegative, and it gives \begin{align*} \|d_S\|_1 \leq \frac{\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}. \end{align*} We substitute this estimate into the penalty-decomposed inequality \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + 3\lambda\|d_S\|_1 - \lambda\|d_{S^c}\|_1. \end{align*} The final term $-\lambda\|d_{S^c}\|_1$ is nonpositive because $\lambda\geq 0$ and $\|d_{S^c}\|_1\geq 0$, so dropping it only weakens the inequality. Therefore \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}. \end{align*} This is the bridge from sparse geometry to prediction error: the active-coordinate $\ell^1$ term has been replaced by the prediction norm of $Xd$. [/guided] [/step] [step:Absorb the remaining prediction term] Define $A \in [0,\infty)$ by \begin{align*} A := \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2. \end{align*} Define $B \in [0,\infty)$ by \begin{align*} B := \frac{1}{n}|X(\beta-\beta^*)|^2. \end{align*} Define $a \in [0,\infty)$ by \begin{align*} a := \frac{3\lambda\sqrt{s}}{\phi}. \end{align*} Since \begin{align*} Xd = X(\hat{\beta}-\beta)=X(\hat{\beta}-\beta^*)-X(\beta-\beta^*), \end{align*} the [Euclidean triangle inequality](/page/Triangle%20Inequality) gives \begin{align*} \frac{|Xd|}{\sqrt{n}} \leq \sqrt{A}+\sqrt{B}. \end{align*} Thus \begin{align*} A \leq B+a\sqrt{A}+a\sqrt{B}. \end{align*} Using [Young's inequality](/page/Young%27s%20Inequality) in the elementary form $uv \leq \frac{1}{2}u^2+\frac{1}{2}v^2$ with $(u,v)=(\sqrt{A},a)$ gives \begin{align*} a\sqrt{A} \leq \frac{1}{2}A+\frac{1}{2}a^2. \end{align*} Using the same inequality with $(u,v)=(\sqrt{B},a)$ gives \begin{align*} a\sqrt{B} \leq \frac{1}{2}B+\frac{1}{2}a^2. \end{align*} Consequently, \begin{align*} A \leq B+\frac{1}{2}A+\frac{1}{2}a^2+\frac{1}{2}B+\frac{1}{2}a^2 = \frac{1}{2}A+\frac{3}{2}B+a^2. \end{align*} Subtracting $\frac{1}{2}A$ from both sides yields \begin{align*} A \leq 3B+2a^2. \end{align*} Since $a^2=9\lambda^2s/\phi^2$, we obtain \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 3\frac{1}{n}|X(\beta-\beta^*)|^2 + 18\frac{\lambda^2 s}{\phi^2}. \end{align*} Because $s=\|\beta\|_0$, this implies \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 18\left\{ \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{\lambda^2\|\beta\|_0}{\phi^2} \right\}. \end{align*} [guided] The compatibility step left us with the estimate \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{3\lambda\sqrt{s}}{\phi}\frac{|Xd|}{\sqrt{n}}. \end{align*} We now convert the remaining $|Xd|$ term into the two prediction errors already present in the theorem. Define $A \in [0,\infty)$ by \begin{align*} A := \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2. \end{align*} Define $B \in [0,\infty)$ by \begin{align*} B := \frac{1}{n}|X(\beta-\beta^*)|^2. \end{align*} Define $a \in [0,\infty)$ by \begin{align*} a := \frac{3\lambda\sqrt{s}}{\phi}. \end{align*} Since \begin{align*} Xd=X(\hat{\beta}-\beta)=X(\hat{\beta}-\beta^*)-X(\beta-\beta^*), \end{align*} the [Euclidean triangle inequality](/page/Triangle%20Inequality) gives \begin{align*} \frac{|Xd|}{\sqrt{n}} \leq \sqrt{A}+\sqrt{B}. \end{align*} Therefore \begin{align*} A \leq B+a\sqrt{A}+a\sqrt{B}. \end{align*} We apply [Young's inequality](/page/Young%27s%20Inequality) in the form $uv\leq \frac{1}{2}u^2+\frac{1}{2}v^2$. With $(u,v)=(\sqrt{A},a)$, \begin{align*} a\sqrt{A} \leq \frac{1}{2}A+\frac{1}{2}a^2. \end{align*} With $(u,v)=(\sqrt{B},a)$, \begin{align*} a\sqrt{B} \leq \frac{1}{2}B+\frac{1}{2}a^2. \end{align*} Substituting both bounds gives \begin{align*} A \leq B+\frac{1}{2}A+\frac{1}{2}a^2+\frac{1}{2}B+\frac{1}{2}a^2 = \frac{1}{2}A+\frac{3}{2}B+a^2. \end{align*} Subtracting $\frac{1}{2}A$ from both sides yields \begin{align*} A\leq 3B+2a^2. \end{align*} Because $a^2=9\lambda^2s/\phi^2$ and $s=\|\beta\|_0$, this becomes \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 3\frac{1}{n}|X(\beta-\beta^*)|^2 + 18\frac{\lambda^2\|\beta\|_0}{\phi^2}. \end{align*} Since $3\leq 18$, we may use the single universal constant $18$ on both terms: \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 18\left\{ \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{\lambda^2\|\beta\|_0}{\phi^2} \right\}. \end{align*} [/guided] [/step] [step:Take the infimum over all sparse comparators] The preceding estimate holds on $\mathcal{E}_\lambda$ for every $\beta \in \mathbb{R}^p$ satisfying $\|\beta\|_0 \leq m$. Therefore taking the infimum over all such $\beta$ gives \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 18 \inf_{\beta \in \mathbb{R}^p,\ \|\beta\|_0 \leq m} \left\{ \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{\lambda^2\|\beta\|_0}{\phi^2} \right\}. \end{align*} This is the claimed oracle inequality. [guided] The estimate just proved was derived after fixing an arbitrary comparator $\beta\in\mathbb{R}^p$ with $\|\beta\|_0\leq m$. No later constant depends on this particular comparator, so the same bound holds simultaneously for every such $\beta$ on the event $\mathcal{E}_\lambda$: \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 18\left\{ \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{\lambda^2\|\beta\|_0}{\phi^2} \right\}. \end{align*} Taking the infimum over the set of all sparse comparators $\{\beta\in\mathbb{R}^p:\|\beta\|_0\leq m\}$ gives \begin{align*} \frac{1}{n}|X(\hat{\beta}-\beta^*)|^2 \leq 18 \inf_{\beta \in \mathbb{R}^p,\ \|\beta\|_0 \leq m} \left\{ \frac{1}{n}|X(\beta-\beta^*)|^2 + \frac{\lambda^2\|\beta\|_0}{\phi^2} \right\}. \end{align*} This is the oracle inequality with the universal constant $C=18$. [/guided] [/step]

Prerequisites (0/1 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Definitions & Concepts

Event

What brings you to Androma?

Start with a route through the knowledge graph.